ga4gh / va-spec

An information model for representing variant annotations.
14 stars 2 forks source link

Variant-Level Metadata object proposal #46

Open mbrush opened 4 years ago

mbrush commented 4 years ago

Proposing a new object/structure that would offer a concise structure to present basic variant information that often accompanies variant annotations, but for which there is no dedicated Statement type. For example:

  1. One or more name/label for the variation (e.g. an HGVS name)
  2. The structural type of the variation, based on extent/structure of the change (e.g. SNV, substitution, insertion, deletion, indel, translocation)
  3. The reference allele occupying the same location as the subject allele (on the same or possibly different reference sequences)
  4. The minor allele occupying the same location as the subject allele, in a particular population
  5. The ancestral allele occupying the same location as the subject allele, in a particular population
  6. The variant taxon, i.e. the species in which the variant is found. Can of course be inferred from the taxon of the affected gene, or the reference sequence for the variation, but some implementers may want to explicitly capture this for users.
  7. Downstream transcript or protein level changes resulting from an upstream variant. (e.g. Val->Ser for protein)
  8. A set of one or more external identifiers/accessions for the subject variation from public/community databases or registries (e.g. the Allele Registry, COSMIC, ClinVar, etc.)
  9. The 'maximal' expansion set for the discrete variation asserted as the subject (in contrast to the set specified in Expansion Set Statements, the 'maximal' set for a discrete variation is always the same, and independent of annotation context) . . . provides view of all other representations of the same underlying variation - not just those to which the annotated knowledge applies.

Structurally, this VarintLevelMetadata (aka "Variant Details", as ClinVar calls an analogous object, or perhaps "VariantAttributes") object would be packaged in a VariantAnnotation object alongside the primary Statement (rather than within it, to maintain the atomic character of Statements).

mbrush commented 4 years ago

We don’t necessarily have to limit the content of the VariationMetadata object to information for which there is no dedicated statement type. It could provide an efficient/concise way of bundling supporting information of any kind without the overhead of representing a statement object for each piece of information. For example, the affected gene, molecular consequence, functional impact, population frequency, or pathogenicity interpretation of the subject variation. But we initially propose the split between simpler/foundational VariationMetadata, and Supporting Statements for more nuanced/complex information because:

AmandaSpurdle commented 4 years ago

point 3 - semantics - but can you change SNP to SNV, since SNP is really used in so many ways and provides misconceptions about frequency and or pathogenicity...

i am not sure what you mean by ancestral allele - is this meaning in the context of an alignment across species. otherwise i would assume it is the reference allele? need to know the definition or are you trying to capture the instances were the reference transcript happens to include a "rarer" allele

point 7 - i assume you mean predicted changes on the basis of the genetic sequence. how are you going to deal with variants that have multiple effects? we have seen coding variants that lead to leaky splicing and also to a missense change that alters protein function - and both these effects were predicted bioinformatically

leicray commented 4 years ago

I wonder about whether the 9 points of the object proposal are non-redundant, but perhaps the intention is to have deliberate redundancy to allow for "sanity-checking".

Point 1 defines a "label" for a variant and HGVS is given as an example. I presume that "HGVS" means a complete and valid HGVS-complaint variant description. If so, that description will be sufficient to innately define the most of the "structural types" mentioned in point 3. Similarly, the "reference allele" in point 4 will have been defined by the reference sequence and the sequence alteration that are necessary parts of an HGVS-compliant variant description.

The one thing that I think might be essential, but is missing at present, is the genome build for the "reference allele" in point 4. There are perfectly valid variant labels (e.g. 17-50198002-C-A or chr17:50198002C>A) which might refer to GRCh37 or GRCh38.