Open mbrush opened 5 years ago
After considering the above proposal, I am wary of including the ref allele as a qualifier in the primary statement, because it is not intended to refine the meaning of what is stated as true. Elements of the statement object should only be things that impact statement semantics. Rather, the ref allele it is just additional information that is 'good to know'.
Another option is to create a new structure to house any simple metadata about the annotated variation, that can be presented alongside the primary statement rather than within it (e.g. a 'variant-level metadata' object). This may include things like:
These bits of info are not really qualifying parts of the statement, but they are also IMO not something worth the overhead of creating separate VA statement objects to represent. And we previously decided, for good reason, that this basic info should not be part of the VR model - because is not foundational identifying information. A 'variant level metadata' structure could capture such information in a concise way outside the statement structure.
Taking this idea a step further, we could even extend the variant-level metadata object to allow concise packaging of information that does have a VA statement representation, which had previously shown being captured in 'supporting statements' for an annotation. e.g. including minimal info about functional impact, population frequency, and affected features in a molecular consequence annotation. This would be used with a data creator wants to simply attach a bag of useful facts to accompany the primary annotation of interest, without the overhead of creating/including full statements for them. This takes us away from our uber-normalized and modular approach - but it may be worth considering for message structures to provide more concise messages when needed. I'm not sure I'd advocate for this approach, but wanted to document it here to keep in our back pocket.
The VR metadata idea is worth exploring further with other use cases outside of PopFreq as I’m confident there will be other types of derivative attributes that may be helpful in various message types. I’m not sure this meta data or derivative data should be required as a core data point but more of an optional use based on how implementations choose to implement a given contract.
Created new ticket to explore the idea of a generally useful Variant Level Metadata object - see #46.
We should consider whether we want to capture as part of VA statements the nature of the 'ref' state when the subject allele is considered 'alt'. This originally came up in the context of Population Frequency annotations, so I will draw examples from this VA type, but it is a consideration for all VA types.
This is of course not relevant in cases where there is no indication of the ref vs alt status of the subject allele - e.g. for Ensembl PopFreq annotations such as these. But most data sources such as CellBase and gnomAD/ExAC indicate the ref->alt change when they annotate an alt allele - either in the HGVS label they give the allele (e.g. here), or as part of the structured data itself (e.g. here).
For good reason, we had previously decided on VR and VA calls that ref vs alt status of an allele should not be captured as part of the foundational variant representation itself. But in the context of an annotation, this status may be important information to convey, and if a source explicitly captures this info we should consider if/how we might do the same. A simple approach would be to allow for an optional 'referenceAlleleQualifier' attribute on the PopFreq statement, to capture the nature of the 'ref' state when the subject allele is considered 'alt'. e.g. for the CellBase example above:
The implication here is that "T" is the reference state for the subject "C" allele, on the NC_000019.9 reference used to define it.
Without something like this, the ref context that many sources provide (gnomAD, CellBase, etc) is lost. Some may argue this is fine, but others may feel differently.