clingen-data-model / allele

Documentation for data model of ClinGen
10 stars 2 forks source link

Options for primaryNucleotideChangeType and primaryAminoAcidChangeType #123

Closed pmcgarvey closed 3 years ago

pmcgarvey commented 9 years ago

Hi

I am trying to fit some protein AA variant data into the model and Baylors documents. I can make a protein simple allele document from our data but I am not certain I can always give the correct AAChangeType.

I can make corresponding genomic simple allele with coordinates for the codon affected but do not know the actual nucleotide change.

Can we have an "unknown" option included in the enumeration for these attributes?

Thanks

Peter McGarvey

srynobio commented 9 years ago

This is something we can consider, the issue with creating an unknown type is that I fear no one will do the leg work needed to correctly annotate SimpleAlleles. Have you considered using some of the annotation tools recommended?

pmcgarvey commented 9 years ago

I understand not wanting an 'unknown'. I would have to then leave it empty for nucleotides.

I have thousands of disease associated AA sequence variations curated from the literature but with a redundant genetic code there is no tool to determine which of several possible options were changed in the DNA. We have a protein seq database not DNA. Someone might have to read the paper again which I hope will happen, I will do a few myself, but it is still useful data.

Thanks

larrybabb commented 9 years ago

Hi Peter, It would be great if you could provide one or two examples of the use case you are hitting. I think I understand, but an example or two would help me to understand the justification immensely. Plus, we will use it in our documentation if we end up adding the "unknown" type as suggested. Thanks.

pmcgarvey commented 9 years ago

Larry

I will. Another day though. Mine might be unique but still useful. Have a phone meeting to attend now.

Peter

pmcgarvey commented 9 years ago

OK here is longer description of our usecase and issues. Hope it is understandable.

............................................................ Protein Alleles

I have a set of ~23,000 human curated disease associated protein variants for 2,120 proteins from UniProt.

I can generate a protein allele document with the AA change from the data I have with associated information and evidence.

I would like to also create an hgvs-genomic allele with these genome positions but it would have some ambiguity codes in the name for example it might look like this NC_000023.11:g:101398904_101398907XXX>XXX.

The reason is, I know the positions of the 3 base codons for the AA on the transcript/genome but I do not know the exact bases changed on the DNA, as there are multiple possibilities in the redundant genetic code. The associated publication the AA change was curated from might have the nucleotide change information plus pedigree information and more but that would be manual effort to extract from the text and figures so we will not be doing it for all anytime soon.

Why bother with genomic allele? Because the genome position information becomes important if it aligns with other functional features like and enzyme active site or similar. We have cases where these variants do align with protein active sites or structural features like di-sulfide bonds providing a clear functional mechanism for a variants effect. Such overlaps were they exist are specified as an evidence level for pathogenicity in the ACMG guidelines and can be used in the pathogenicity calculator.

Our issue with the current model is 1) For primaryNucleotideChangeType I do not know the change. It is probably a substitution in most all cases but for a few cases it could be something different. A minor problem I guess I might have to leave empty in some cases unless there is an unknown option. which could be abused by others. 2) There are cases in our data where the 3 base codon is interrupted by an intron so a single start and stop position will not accurately specify the genome location. For example

UniProtAccession, GeneName, UniProtVARID, AlleleName, AssociatedDisease Q14524, SCN5A, VAR_026344, Q14524:p.Glu161Lys, Brugada syndrome 1 (BRGDA1) [MIM:601144]

The codon for the reference Glutamic acid AA at this position on the transcript is GAG (tcGAGtac) however on the genome there is a ~1,430 base intron between the A and G ( tcGA…intron…Gtc ) So instead of specifying this as genomic start = 38622399 end = 38620971 we feel a list would be more accurate for example Start = 38622399, Stop = 38622401, Start = 38620970 Stop = 38620971.

We have other cases like this. Though I expect 99% of other folks data will not need this list option but a list should not interfere with anything on their end.

srynobio commented 9 years ago

I understand your issue and I think this is the solution:

If you review our resource model for SimpleAllele you will see that

SimpleAllele.referenceCoordinate.primaryTranscriptRegionType

  Definition: One of the set of allowable primary-transcript-region-types
  Type Code 
  Control  0..1

Which mean this is not a required field (although prefered).

What I would do is use ancillary-transcript-region-type which would allow you to use the SO term: sequence_variant or any child of sequence variant which seems correct for your data.

Let me know if this works, and your results.

pmcgarvey commented 9 years ago

Thanks that solves item #1 above in my description and the title of this tread.

There still is item #2 for a genomic allele.

2) There are cases in our data where the 3 base codon is interrupted by an intron so a single start and stop position will not accurately specify the genome location.

Thanks

cbizon commented 9 years ago

Hi Peter,

I've been out of town for a while, and am just getting caught back up. I apologize for perhaps taking the discussion backwards, but I wonder if I can jump in...

So if I understand correctly, you have an amino acid change that you want to put into the model. You don't know how to correctly specify the nucleotide change that led to that amino acid change. This might be for a number of reasons, either degeneracy in the codons, or because of intronic issues and so on. (If I've misunderstood, please let me know).

If I've got that characterized correctly, then I think that the important thing to point out is that the model does not require the nucleotide change. In fact, the model considers the amino acid allele to be a completely separate entity from the nucleotide allele. (This is basically for the reasons you mentioned: there's not a simple 1 to 1 relationship between them). The model does allow an amino acid allele to be associated to a nucleotide allele, but that association is not an assertion that the particular nucleotide allele was ever observed - simply that if the nucleotide allele did occur, and without anything funny happening, then the amino acid allele would be the result.

So, my suggestion would be to

1) just create amino acid alleles, and not worry about the nucleotide alleles or 2) create amino acid alleles, and create some or all of the possible underlying nucleotide alleles, and create those as well, and associate them with the amino acid allele.

As for the case where the intron splitting a codon happens, the difference between a nucelotide allele and an amino acid allele also helps, I think. The amino acid allele doesn't have genomic coordinates, it only has coordinates with respect to the protein sequence. One could also create a nucleotide allele: assuming that the Glu->Lys is because of the codon going from GAG to AAG, then the nucleotide allele would include only the SNP G>A at position 38622399.

Let me know if this makes any sense or not...

Chris