test data files (fasta and fastq) with @ instead of >

bioconvert / bioconvert

Bioconvert is a collaborative project to facilitate the interconversion of life science data from one format to another.

http://bioconvert.readthedocs.io

GNU General Public License v3.0

365 stars 43 forks source link

test data files (fasta and fastq) with @ instead of > #68

Closed cokelaer closed 6 years ago

cokelaer commented 6 years ago

@blaiseli

You mentionned

"Regarding the md5 sum for the test fasta file, we shouldn't have a fasta file where headers are introduced by "@" instead of ">".

For now, interestingly, biopython, seqtk and gatb handle this old format. This is not a standard format but may be provided by old sequencer. I would suggest to keep it for now and add same data set with @ replaced by >

blaiseli commented 6 years ago

I'm using fasta files since 2004 and have never heard about fasta headers beginning with "@". I don't find any mentions of this in the internet. It seems that SeqIO.parse silently skips records having this kind of header.

cokelaer commented 6 years ago

My mistake, this explains my wrong results regarding the biopython performances. the simulator is fixed and I believe we have the same benchmark performances . I've udapted the documentation with a nice benchmark for the fastq2fasta.