Provenance for sequencing data in PopFreq Annotations

The evidence and provenance CQs here (rows 16-20 specifically) highlight the need to capture metadata on sequencing data quality, technology, and other aspects of variant detection relevant as provenance in population frequency studies. This ticket will be used to elicit input and feedback on this topic.

Our short term goal is to define a simpler model for a v0 release in the near future. Longer term we will want to coordinate with other GA4GH efforts, and drivers/stakeholders/partners like HL7-CG, to develop a richer, shared model of sequencing studies/metadata, covering broad use cases, which can be submitted to schema blocks and re-used by the broader community.

We need folks versed in this area to make recommendations about what is really needed and how we might structure it. Specifically, the types of technologies and protocols applied in population sequencing efforts, and specific metrics/scores used to assess sequencing data quality/reliability. Within our VA group, Irina is knowledgeable here, and Steven Hart has added some insightful comments as well. I also suspect that the HL7 Clinical Genomics group has thought about this issue, so Bob and/or one of his colleagues here could advise as well. Please note if there are others we might reach out to.

Starting a list of requirements here based on the CQs and data examples - but fully aware this is naïve and incomplete. Hoping folks can help round this out (feel free to edit this list/comment directly)

Experimental provenance:

the sequencing method and technology/platform that generated the sequence data
the molecule type/level of the sequenced material (rna, exome, genome)
the evaluant from which sequenced nucleic acids were taken (e.g. cheek swab, WBC, etc)
whether the variant was directly observed or imputed in the population? (should this be a separate/new type of statement? or perhaps a qualifier on the statement?)

Quality metrics/flags:

gnomAD Genotype Quality Metrics (link) a. genotype quality b. depth (median or average across genome for carriers vs non carriers? histogram by depth bucket?) (is depth same as coverage?) c. allele balance
gnomAD Site Quality Metrics (link) a. 13 different scores/metrics here - which are important/relevant?
What is the gnomAD 'Popmax Filtering AF'
A 'minimum frequency threshold' was mentioned on a call
"Flags for assessing whether the frequency was potentially affected by technical artifacts, such as low counts or a low-complexity genomic region"?

SNH comment suggested that there are quality scores specific to different sequencing technologies and platforms.

Questions/Considerations:

Keep in mind that the metadata we are interested in here applies at the level of the reported population study (not a sequence run for an individual). So it should reflect metadata relevant to sequencing all population members (e.g. parameters/metrics/tools applied in sequencing of all individuals, or statistics calculated across all individuals).
At what level of detail should we capture/structure such metadata?
a. As noted, we may have different short vs long term requirements here. b. And keep in mind our primary use case as a data exchange standard. We should only exchange data that may be directly used for search/faceting/filtering in APIs and data operations. For complete metadata, it may be best to simply point out an external reference - e.g. to gnomAD directly.

ga4gh / va-spec

Provenance for sequencing data in PopFreq Annotations #43