Support FASTA + QUAL (not just for colour space)

alexstaj / cutadapt

Automatically exported from code.google.com/p/cutadapt

0 stars 0 forks source link

Support FASTA + QUAL (not just for colour space) #6

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

Hi,

The source code comment at the start of the script says:

If two file names are given, they are assumed to be
.csfasta and .qual files as produced by the SOLiD sequencer.
(You still need to provide the -c option to correctly deal
with color space.)

It could be useful to support sequence space FASTA + QUAL, most commonly found 
as the output from Roche 454 since the manufacturer's software will convert 
binary SFF files to FASTA + QUAL (and at the time of writing does not offer SFF 
to FASTQ).

If cutadapt does already cope with this, then the description quoted needs to 
be updated.

Original issue reported on code.google.com by p.j.a.c...@googlemail.com on 8 Feb 2011 at 10:35

GoogleCodeExporter commented 9 years ago

Could you perhaps send me or attach to this issue an example of a FASTA and a 
QUAL file as produced by the Roche software?

Original comment by marcel.m...@tu-dortmund.de on 8 Feb 2011 at 11:19

GoogleCodeExporter commented 9 years ago

For a short example (10 reads), see:

http://biopython.open-bio.org/SRC/biopython/Tests/Roche/
or
https://github.com/biopython/biopython/tree/master/Tests/Roche

Original SFF file:

E3MFGYR02_random_10_reads.sff

The trimmed reads (what people normally work with):

E3MFGYR02_random_10_reads.fasta
E3MFGYR02_random_10_reads.qual

The untrimmed reads (with 454 adapter and poor quality bases):

E3MFGYR02_random_10_reads_no_trim.fasta
E3MFGYR02_random_10_reads_no_trim.qual

Original comment by p.j.a.c...@googlemail.com on 8 Feb 2011 at 12:07

GoogleCodeExporter commented 9 years ago

Thanks, I'll look into this. May take a few days.

Original comment by marcel.m...@tu-dortmund.de on 8 Feb 2011 at 12:44

Changed title: Support FASTA + QUAL (not just for colour space)

GoogleCodeExporter commented 9 years ago

I have added support for non-colorspace .FASTA+.QUAL files to cutadapt. It 
seems to work although the output is different from BioPython's trimmed 
sequences. This seems to be due to a different low-quality trimming algorithm. 
Remember to use the -b parameter to search for adapters that are potentially in 
the beginning of reads. If you don't, then all reads in which an adapter was 
found will be empty after trimming.
Can you get cutadapt from Subversion in order to test this? Otherwise I'll just 
release a new version.

Original comment by marcel.m...@tu-dortmund.de on 14 Feb 2011 at 2:58

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

The trimmed examples in the Biopython tests are just applying the trimming 
information in the SFF file itself (just like the Roche off instrument 
application does it).

Original comment by p.j.a.c...@googlemail.com on 14 Feb 2011 at 4:01

GoogleCodeExporter commented 9 years ago

Thanks, that also explains where the lowercase nucleotides in the untrimmed 
files come from.

Original comment by marcel.m...@tu-dortmund.de on 15 Feb 2011 at 6:46