Constraining annotation subject to transcript level variants for Molecular Consequence

mbrush commented 5 years ago

Should we require that a 'transcript-level' variant as the subject for this annotation type (and perhaps other VA types)?

Some data sources will enforce this and provide only transcript-level variants (e.g. ClinGen). Others will not enforce this (e.g. GeL/Cellbase) and provide genome-level variants - but these should always be qualified/contextualized by a specific transcript where the reported consequence occurs.

Consider implementation and modeling challenges if we do vs do not enforce this constraint in our model.

If this is not constrained at point of data creation (e.g. data creators can use genomic or transcript or protein level variants as subjects of Molecular Consequence), then normalization needs to be done downstream.

post-processing of GA4GH compliant data to convert genomic representations (plus transcript qualifiers) to a transcript representations
expansion of queries for transcript variant to also query for all genomic variants that it can derive from, and all protein variants it can derive into. Ideally we would create/provide a service that automates any necessary post processing - through transformation/normalization of the data, or expansion of the query.

Additionally, NOT constraining variant type to the transcript level creates a modeling challenge in that a transcript qualifier on the primary statement becomes necessary in cases where a genome-level variant is used as the subject.

If, however, variant normalization to a specific sequence type is enforced at data creation, we avoid these issues, but place a burden on data creators. We may need to provide some support / service for source data using genome or protein level variants to facilitate/automate the conversion of these to transcript-level representations.

mbrush commented 5 years ago

This issue touches upon is the balance between too much flexibility (good for adoption, bad for interoperability) and too little (makes adoption harder but improves out of the box interoperability). I think we need a deeper understanding about the primary use case(s) for the model, how DPs anticipate populating the model with their data, and how much flexibility we need to support in cases like this.

We may want to walk through data examples in our catalog and consider problems that we may encounter if require transcript level variant (e.g. GeL and CG to walk through their data examples, and how they would transform to the standard if we do constrain variant type, or create molecular consequence value sets that are too strict)

mbrush commented 5 years ago

Looked at the GeL / Cellbase example here, and noted that one use case for allowing genomic level variant representation is when the variant is annotated as an upstream or downstream variant (see the TOMM40 annotation). In this case there may be no specific transcript or protein affected (if the statement is just that it is upstream of the gene, as opposed to a specific transcript of the gene) - so could makes sense to specify the variant at a genomic level.

But this also raises the question as to whether this is even a molecular consequence annotation, as opposed to an affected feature annotation.

rrfreimuth commented 5 years ago

I like the category definitions that you developed during the review of molecular consequence. I think it may be helpful to recognize that they require different types of information bundled together to make a statement.

Ref my previous comment, where I suggested that these terms require an allele AND a sequence AND a set of annotations.

Variation class: requires a comparison between two sequences (a reference and an alt sequence), each containing a different allele
Affected feature type: requires the location of a specified allele (but not necessarily the allele itself) on a specified sequence with a given set of annotations
Processing consequence: requires a specified reference sequence (containing one allele) that has a given set of annotations, an alt sequence (containing the other allele) with the same set of annotations, and a predicted processed sequence for both the ref and alt sequences
Feature consequence: requires a comparison between two sequences (a reference and an alt sequence), each containing a different allele, with a given set of annotations

I find it is helpful to think about the entities and relationships that would be required to make each of these assertions when trying to decide how they might differ at the model level, but YMMV.

ga4gh / va-spec

Constraining annotation subject to transcript level variants for Molecular Consequence #16