datacarpentry / wrangling-genomics

Data Wrangling and Processing for Genomics
https://datacarpentry.org/wrangling-genomics/
Other
69 stars 151 forks source link

Remove FASTQ encoding variants; explain ASCII more #45

Closed peterjc closed 6 years ago

peterjc commented 7 years ago

The history of FASTQ encodings and the legacy Solexa/Illumina variants is a distraction, and in my opinion can be removed from the QC lesson:

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/_episodes/00-readQC.md

Furthermore since the lesson is under the CC-BY, there is a licensing problem with lifting the CC-BY-SA table from https://en.wikipedia.org/wiki/FASTQ_format (and without even giving the source URL, which would be the minimum form of attribution).

I would instead to reduce this section to focus on introducing ASCII and the idea that characters have an associated numerical code, and this is used with an offset to store the quality score using one character per base.

The different encodings can be reduced to a footnote or remark, and reference to the wikipedia page https://en.wikipedia.org/wiki/FASTQ_format or our paper https://doi.org/10.1093%2Fnar%2Fgkp1137

(And then move on to explain PHRED scores)

hoytpr commented 7 years ago

Speaking as a life scientist/coder wannabe: The legacy encodings and the variants are annoying, and you are correct to want to dispose of this section, but my opinion is not quite yet.

There are probably thousands of life scientists that have legacy data sitting on their computers waiting for analysis. Much of life science proceeds slowly, with obvious exceptions. From experience, and based on the number of life science upper faculty in past workshops, many researchers still plan to analyze old data. They are often reluctant to give it away for analyses, or may want to combine it with newer data.

Parenthetically, presenting the FASTQ variants may have the secondary benefit of encouraging (scaring) these researchers to share their data with an informatician.

I'm sympathetic to your expert opinion. You've been involved with these formats since before 2009, but in our workshops this can be the first exposure to the possibility of alternative legacy score encodings. The Wikipedia page would be confounding.

Recognizing your original work, and at the very least a source URL to Wikipedia is appropriate.

I would instead to reduce this section to focus on introducing ASCII and the idea that characters have an associated numerical code, and this is used with an offset to store the quality score using one character per base.

This (quoted above) should be included to provide justification for the use of ASCII vs. numeric scoring.

peterjc commented 7 years ago

If you've found from personal experience that lots of the people likely to take this workshop still have legacy FASTQ data, then that does justify including the alternative encodings (at least for another year or so).

I'm glad you agree a bit more about ASCII would be needed (I had to introduce this to a PhD student and separately a PostDoc recently, its something many of the workshop attendees will not have seen before).

ErinBecker commented 6 years ago

Great discussion @peterjc and @hoytpr and thank you for coming to a resolution so quickly! I'm working on the lessons today and happy to put in a PR to address this issue, but want to also give you the opportunity @peterjc to do so if you would like. Please let me know if you'd like me to leave this open for you - else I'll put in a PR by end of day today.

peterjc commented 6 years ago

Its already evening my time, so @ErinBecker feel free to go ahead with a PR - and I'll try to review it tomorrow if that's wanted.

ErinBecker commented 6 years ago

Sorry not to have updated this issue before. A PR related to this was put in and accepted a few weeks ago (https://github.com/datacarpentry/wrangling-genomics/pull/69). I'm not sure if it does what you were envisioning @peterjc, but it at least solves the problem of lifting the table from Wikipedia. I think it also helps to simplify the issue by explicitly stating that different sequencing platforms have different quality encoding systems, without going into the details of what all of those different systems are. Please let me know if you think this should be closed or if you'd like to see another change to this material (after the release tomorrow).

peterjc commented 6 years ago

I think #69 is a big improvement, but have submitted #97 with some minor polish.

Right now the text does not explain what ASCII is and the low level numerics of mapping a letter into a quality score, but I'm OK with that. It is a distraction which can be glossed over given what else needs to be covered in a limited time.

peterjc commented 6 years ago

97 was merged. I'm happy to close this issue now. Thanks all.