ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Variant annotation conceptual structure discussion #214

Closed stevenbrenner closed 8 years ago

stevenbrenner commented 9 years ago

This thread is to discuss the conceptual structure and scope of variant annotation schemas.

Currently methods like VEP, SnpEff, Annovar, and Varant provide annotation in the VCF file in ways that embody historical legacy, are constrained by the VCF format, and not broadly extensible. The purpose of this discussion is to consider the conceptual nature of these annotations, and the degree of complexity we want them to embody (for example, how much locus and haploytype information) before resolving to a detailed schema.

kellrott commented 9 years ago

I'm interested to see where this discussion goes. For the Genotype2Phenotype schema ( https://github.com/ga4gh/schemas/issues/196 ) having high level descriptions, like 'TP53 mutation' would allow us to much more easily link phenotypes to variants. This is slightly different a straight variant annotation, in that there is no concrete sample and variant call linked to the variant 'concept' (ie talk about 'TP53 mutation' as a biological concept without actually having to point to a specific caller that was run on a particular sample). But that data structure should be comparable to a variant call that was generated (ie 'TP53 mutation is sample_10 as determined by MuTect)

jacmarjorie commented 9 years ago

Here's some suggestions based off a clinical decision support db:

Therapeutic variants

iskandr commented 9 years ago

The word "annotation" seems awfully overloaded in the context of genomics. Is this thread talking specifically about immediate effects such as deletion of first two intronic residues probably messes with splicing? Or does this thread also include higher level annotations such as disruption of this particular binding domain in TP53 is known to be oncogenic?

Some suggestions for the modest former case:

1) It's only meaningful to talk about a genomic variant's effect on a particular transcript. The existing conventions of summarizing a variant by its most deleterious effect on any transcript or its effect on the longest transcript are both crude rules of thumb. It's probably desirable to preserve all the transcript-specific effects, even at the cost of moving away from stuffing effect annotations into a VCF.

2) Annotation programs don't typically make it easy to determine the altered protein sequence for variants other than SNVs (i.e. "V348fs" doesn't tell me much). It would be very useful (particularly for immunoinformatics) to preserve inferred protein sequences for frameshift and stop-loss variants. Additionally, if translation start site or splicing prediction programs show sufficiently high accuracy, then getting protein sequences for start-loss and splicing variants would also be nice.

diekhans commented 8 years ago

variant annotations has been accepted into the schema; I believe this issue is resolved

iskandr commented 8 years ago

@diekhans Link to the variant annotation schema?

diekhans commented 8 years ago

schema is: https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/alleleAnnotations.avdl https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/alleleAnnotationmethods.avdl

mbaudis commented 8 years ago

(deleted wrong note about id use)

sarahhunt commented 8 years ago

Hi @mbaudis! analysisId is a reference to the Analysis record in metadata. The AnalysisResult does not have an id of its own.

mbaudis commented 8 years ago

@sarahhunt Oh ... I got confused (which may not be a good sign, either for my attention span or the documentation...). I'll delete my comment.

sarahhunt commented 8 years ago

@mbaudis - all feedback is useful. I'll have a look at the documentation. We have a nice diagram here -

http://ga4gh-schemas.readthedocs.org/en/latest/api/alleleAnnotations.html

but it's not very discoverable.