fairgenomes / fairgenomes-semantic-model

FAIR Genomes semantic metadata model. The core is a YAML file, which is transformed into all other desired output formats.
Other
10 stars 7 forks source link

Use median instead of average values if possible #119

Open mmokrejs opened 2 years ago

mmokrejs commented 2 years ago

Hi, I incidentally poked over your project and I wonder why you keep track of Average read depth and of Observed insert size. The former would be better replaced with Median read depth and the latter probably called Outer mate median distance instead? Depends on the tool used to analyze the data. Seems too much Illumina-technology oriented. How will this work for PacBio and Oxford Nanopore sequencing projects? And for older Roche/454 and IonTorrent-based projects which used totally different types of library prep. protocols (RF vs. FR read orientations, etc)? Likewise, Sanger-based genome sequencing?

joerivandervelde commented 2 years ago

Hello and thanks so much for your feedback! You are right, Median read depth is a more robust quality metric than Average read depth because the latter may be inflated by extreme outliers. The model will be updated soon. Indeed it cannot be denied that there is an Illumina bias in the model because they are currently the predominant vendor, at least in The Netherlands. Your help to resolve this is most appreciated. So is Outer mate median distance a more generic than Observed insert size (i.e. can this term be used for the same and more situations, instead of different ones?) if so, we could replace the term. If not, we probably should introduce a new ontology term. Could you perhaps provide a definition for the Outer mate median distance term, similar to the one that we have for Observed insert size ? which is:

In paired-end sequencing, the DNA between the adapter sequences is the insert. The length of this sequence is known as the insert size, not to be confused with the inner distance between reads. So, fragment length equals read adapter length (2x) plus insert size, and insert size equals read lenght (2x) plus inner distance.

mmokrejs commented 2 years ago

Hi @joerivandervelde , I am sorry things are stacking up in my mailbox ...

OK, so you fixed the first part already, the read depth-related calculation.

https://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html https://www.biostars.org/p/411012/ https://www.researchgate.net/publication/260170597_Assessment_of_Insert_Sizes_and_Adapter_Content_in_Fastq_Data_from_NexteraXT_Libraries/figures?lo=1

The field you keep in ontology should describe how the sequencing library was prepared and how long the DNA fragements were, on average or better on median. Unfortunately, people tend to discriminate fragment size and insert size, depending whether adapter have been already added or not.

Practically, different SW tools calculate either outer or inner distance. I assume goal of the catalogue is to either collect either of the two values of to decently push users to calculate a single/intended value again using the right software.

See https://broadinstitute.github.io/picard/picard-metric-definitions.html#InsertSizeMetrics https://gatk.broadinstitute.org/hc/en-us/articles/360037225252-CollectInsertSizeMetrics-Picard-

In other words, this annotation term is supposedly about samtoolss 0x100 SAM_TLEN flag, which shows up in column 9 of SAM formatted output.

While at it, probably you want to add also terms for https://broadinstitute.github.io/picard/picard-metric-definitions.html#JumpingLibraryMetrics .

joerivandervelde commented 2 years ago

Hello @mmokrejs we've been having internal discussions on how to tackle this but haven't quite sorted it out - could you perhaps clarify the change you are proposing? If metrics are not generic for all sequencing platforms, we might also consider to model it something like this, might that make sense ?