IQTLabs / hypothesis-bio

Hypothesis extension for computational biology
https://lab41.github.io/hypothesis-bio
Apache License 2.0
15 stars 2 forks source link

Add a separate options for generating Illumina, Sanger, Solexa FASTQ strings #37

Closed vaastav closed 4 years ago

vaastav commented 4 years ago

The actual FASTQ format, https://en.wikipedia.org/wiki/FASTQ_format, suggests that the quality scores have characters from 33 to 126. But according to @mbhall88 in commit e91a86c4f191c8c1130f30938b50dd6629fb5db, the Illumina scores are from 64-126 instead.

As @mbhall88 pointed out, there are 3 different quality string formats and we should be supporting all of them : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

I think we should instead have a different option for generating Illumina sequences as compared to FASTQ sequences. The reason behind this is that the Sequences from the Illumina software use a systematic identifier instead of the random thing we are generating.

vaastav commented 4 years ago

@Benjamin-Lee, @mbhall88 what do you guys reckon is the right way forward?

I think we should be supporting all of these formats.

mbhall88 commented 4 years ago

We do support all these formats. i.e users can select the range they want if they don't like the default.

vaastav commented 4 years ago

Hmm.... It wasn't obvious to me that we supported all of those formats. Maybe we need to explicitly mention it in the documentation?

mbhall88 commented 4 years ago

Hmm the more I think about this actually the more I realise I am thinking about this in slightly the wrong way. In normal libraries you want your defaults to be inline with what most circumstances will want. But with us we want them to be as "non-normal" as possible (within the constraints of the file type specifications) as we are trying to identify where people aren't handling certain cases. Then if they want to restrict to only "normal" use-cases then they can. Seems it has taken me a while to come to this realisation sorry.

mbhall88 commented 4 years ago

I will do some work on changing this now and put in a PR

mbhall88 commented 4 years ago

Ok, it seems like this is the first "formal" specification for the fastq format. I will ensure the defaults adhere to all possibilities within this.

mbhall88 commented 4 years ago

Question regarding line-wrapping for FASTQ:
It is not "against" the specs to have the sequence and quality lines wrapped. However in the formal specs they strongly recommend not as it makes parsing extremely difficult.

It is vital to note that the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string—including at the start of any of the quality lines. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence.

Because of this complication, most tools output FASTQ files without line wrapping of the sequence and quality string. This means each read consists of exactly four lines (sometimes very long lines), ideal for a very simple parser to deal with. The OBF tools follow this convention on output, as does the MAQ conversion script. We recommend this for maximum compatibility with (simplistic) parsers.

What do we think we should use by default?

Benjamin-Lee commented 4 years ago

Wrap wrap wrap

Benjamin-Lee commented 4 years ago

If it's not against the spec, then Hypothesis should generate it to find more edge cases

mbhall88 commented 4 years ago

Ok. Going to create a suite of functions to generate Illumina, PacBio, and Nanopore sequence ID/header lines