ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

How do we handle hard cases of variant naming (with examples)? #333

Open reece opened 9 years ago

reece commented 9 years ago

This issue was created in response to a sub-thread in #312, excerpted below. The goal of the issue is to identify cases of hard-to-name variants and potential solutions.

@aws wrote:

Can CSN handle very famous variants such as the causative mutation of sickle cell anemia (when homozygous) - HBB E6V? http://ghr.nlm.nih.gov/gene/HBB "Specifically, the amino acid glutamic acid is replaced with the amino acid valine at position 6 in beta-globin, written as Glu6Val or E6V." versus http://www.ncbi.nlm.nih.gov/clinvar/RCV000016573 http://www.uniprot.org/uniprot/P68871 "NM_000518.4(HBB):c.20A>T (p.Glu7Val)" The preferred name in ClinVar. N.B. that prior to cleavage of the initiator methionine HBB E6V would actually be HBB E7V. Should we collect a set of clinically important but hard to name variants?

And a reply from @fcunningham:

Great - we have a folder for hard to annotate variants - please do contribute clinically important but hard to name variants here: https://drive.google.com/folderview?id=0B6jIo0eTEQxrfmJPSF9hQ1pscWNJVUc5bldCVWpGOVd1QzJORklxOTJLVnE3d2pkWmt2N2c&usp=sharing

EliseRuark commented 9 years ago

I might be missing something; is there a description anywhere of why each variant is difficult, and some indication of what we would like the annotation to capture?

jzook commented 9 years ago

The variants in the folder Fiona mentioned were cases I compiled from NA12878 where multiple indels were in exons and could be represented in vcf in different ways, but they pose a particular challenge for annotation. These are only one type of hard case though.

Cheers, Justin

On Mon, Jun 15, 2015 at 10:37 AM EliseRuark notifications@github.com wrote:

I might be missing something; is there a description anywhere of why each variant is difficult, and some indication of what we would like the annotation to capture?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/333#issuecomment-112091787.

lh3 commented 9 years ago

@awz's example is not a complex variant Fiona/Justin's table intends to solve. It is an inconsistency between protein and CDS coordinate due to post translation modification, which could be common as Met cleavage seems not rare in human.

awz commented 9 years ago

@lh3 thanks for clarifying. It is a bit frustrating that this example is both so important and, potentially, quite common as you say.

Maybe we need a higher level document that describes the various "buckets" that hard to name variants fall in. Then we could make a separate collection of representative examples for each bucket.

pgrosu commented 9 years ago

Let me pose this question. How exactly would you prefer the ideal way of annotating such complex/multi-domain cases? Would you like to group multiple variants into one annotation? If the order is not important, then we can generate regular expression rule-sets for any number of combinations including regions that map back to the same annotation. The results can then be transmitted similarly to this as for proteomics:

http://wiki.thegpm.org/wiki/Nomenclature_for_the_description_of_protein_sequence_modifications

In fact, we can even hash (SHA256) an annotation group and just transmit that so as to minimize the wire throughput, and then match it at the receiving end against a dictionary to expand back the full annotation.

Basically what I'm asking is take a group and just write how you would ideally like to annotate it - in full form without any compression - even if it has multiple annotations, since an annotation group/set is possible.

Thanks, Paul

pcingola commented 9 years ago

If I understood correctly, the HGVS "p." notation's page says:

...descriptions at protein level should describe the changes observed on protein level and 
not try to incorporate any knowledge regarding the change at DNA-level 

nevertheless, if there is no protein level evidence:

...to indicate that the description at protein level is without any experimental evidence it is 
recommended that, when RNA nor protein has been analysed, the description is given 
between brackets, like p.(Arg22Ser)

So, to answer the question,

i) NM_000518.4(HBB):c.20A>T (p.Glu7Val) would be used to indicate protein change calculated from DNA evidence.

ii) Glu6Val would be used to indicate "protein level evidence" (unless there is a reliable way to predict Met cleavage using DNA / RNA sequence alone). Do you know any algorithm?

In my opinion, since in sequencing experiments we don't have protein level evidence, the first way of expressing the variant would be preferred.