biopython / biopython

Official git repository for Biopython (originally converted from CVS)
http://biopython.org/
Other
4.31k stars 1.74k forks source link

Strict four line FASTQ parser #1812

Open peterjc opened 5 years ago

peterjc commented 5 years ago

As of Biopython 1.71, Bio.SeqIO now supports reading and writing two-line-per-record FASTA files under the format name "fasta-2line", useful if you wish to work without line-wrapped sequences.

Related to this, I was thinking about adding a similar FASTQ parser for Sanger encoded files which use exactly four lines per record (which makes up the vast majority of FASTQ files in use now as we hoped when writing http://dx.doi.org/10.1093/nar/gkp1137 - line wrapping is almost never used, and current sequencers do not use the legacy Solexa/Illumina specific encodings).

In terms of naming for the SeqIO interface format names https://github.com/biopython/biopython/blob/biopython-172/Bio/SeqIO/__init__.py#L419 defines:

                     "fastq": QualityIO.FastqPhredIterator,
                     "fastq-sanger": QualityIO.FastqPhredIterator,
                     "fastq-solexa": QualityIO.FastqSolexaIterator,
                     "fastq-illumina": QualityIO.FastqIlluminaIterator,

Currently these all use function FastqGeneralIterator internally which supports line wrapped records. The goal here is to have an alternative faster low parser which only expects 4-line FASTQ records, e.g. FastqFourLineIterator to match the existing naming convention in this file (which is not PEP8 compliant).

We might expose this in Bio.SeqIO as format name "fastq-4line" (or "fastq-sanger4line"?), and I would consider making "fastq" to use this too - or indeed just making "fastq" mean the four line variant?

Given his good work on #1805, @chris-rands might want to tackle this?

chris-rands commented 5 years ago

Hi Peter, I can't work on this right away, but assuming there is no rush, I'd be happy to help so you can assign it to me :thumbsup:

peterjc commented 5 years ago

That'd be great - thank you. This would be nice to have, but does not not seem urgent.

DevangThakkar commented 5 years ago

@chris-rands @peterjc Do you mind if I try working on this?

chris-rands commented 5 years ago

@DevangThakkar I can't speak for Peter, but from my perspective that would be good, thanks. Things have got busy for me, so I haven't actually started working on this yet. I can help review a pull request if this is appropriate

peterjc commented 5 years ago

Sure - @DevangThakkar please give this a go. Are you clear on the goal here, and will you be able to do some timings (see Chris' work on the FASTA side for some ideas there)?

DevangThakkar commented 5 years ago

@peterjc Yeah, I think I am. I've opened a pull request with the new parser code and the timings comparing the existing and the new parser similar to the one @chris-rands had compiled. Let me know what you think of it, and do let me know if we can make it even faster somehow.