Open reece opened 9 years ago
I might be missing something; is there a description anywhere of why each variant is difficult, and some indication of what we would like the annotation to capture?
The variants in the folder Fiona mentioned were cases I compiled from NA12878 where multiple indels were in exons and could be represented in vcf in different ways, but they pose a particular challenge for annotation. These are only one type of hard case though.
Cheers, Justin
On Mon, Jun 15, 2015 at 10:37 AM EliseRuark notifications@github.com wrote:
I might be missing something; is there a description anywhere of why each variant is difficult, and some indication of what we would like the annotation to capture?
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/333#issuecomment-112091787.
@awz's example is not a complex variant Fiona/Justin's table intends to solve. It is an inconsistency between protein and CDS coordinate due to post translation modification, which could be common as Met cleavage seems not rare in human.
@lh3 thanks for clarifying. It is a bit frustrating that this example is both so important and, potentially, quite common as you say.
Maybe we need a higher level document that describes the various "buckets" that hard to name variants fall in. Then we could make a separate collection of representative examples for each bucket.
Let me pose this question. How exactly would you prefer the ideal way of annotating such complex/multi-domain cases? Would you like to group multiple variants into one annotation? If the order is not important, then we can generate regular expression rule-sets for any number of combinations including regions that map back to the same annotation. The results can then be transmitted similarly to this as for proteomics:
http://wiki.thegpm.org/wiki/Nomenclature_for_the_description_of_protein_sequence_modifications
In fact, we can even hash (SHA256) an annotation group and just transmit that so as to minimize the wire throughput, and then match it at the receiving end against a dictionary to expand back the full annotation.
Basically what I'm asking is take a group and just write how you would ideally like to annotate it - in full form without any compression - even if it has multiple annotations, since an annotation group/set is possible.
Thanks, Paul
If I understood correctly, the HGVS "p." notation's page says:
...descriptions at protein level should describe the changes observed on protein level and
not try to incorporate any knowledge regarding the change at DNA-level
nevertheless, if there is no protein level evidence:
...to indicate that the description at protein level is without any experimental evidence it is
recommended that, when RNA nor protein has been analysed, the description is given
between brackets, like p.(Arg22Ser)
So, to answer the question,
i) NM_000518.4(HBB):c.20A>T (p.Glu7Val) would be used to indicate protein change calculated from DNA evidence.
ii) Glu6Val would be used to indicate "protein level evidence" (unless there is a reliable way to predict Met cleavage using DNA / RNA sequence alone). Do you know any algorithm?
In my opinion, since in sequencing experiments we don't have protein level evidence, the first way of expressing the variant would be preferred.
This issue was created in response to a sub-thread in #312, excerpted below. The goal of the issue is to identify cases of hard-to-name variants and potential solutions.
@aws wrote:
And a reply from @fcunningham: