clingen-data-model / allele

Documentation for data model of ClinGen
10 stars 2 forks source link

Canonical Allele ClinVar Identifier #72

Closed cbizon closed 3 years ago

cbizon commented 9 years ago

In CA402, there is an identifier like this:

  "identifier": 
  [
    {
      "label": "NM_007294.3(BRCA1):c.5297T>G (p.Ile1766Ser)",
      "system": "http://www.ncbi.nlm.nih.gov/clinvar/variation",
      "value": "37656"
    }
  ],

But this is maybe not quite right. The ClinVar version of the canonical allele includes both nucleotide and amino acid versions, while our Canonical Allele is strictly nucleotide. So the two are not identical.

I would be tempted to do something like this:

  "identifier": 
  [
    {
      "label": "NM_007294.3(BRCA1):c.5297T>G (p.Ile1766Ser)",
      "system": "http://www.ncbi.nlm.nih.gov/clinvar/variation",
      "value": "37656#nucleotide"
    }
  ],

It means that the full url is no longer dereferenceable, but I don't think that's required. Other opinions?

srynobio commented 9 years ago

fhir spec:

Identifier.system : URI

Identifier.value : string The portion of the identifier typically displayed to the user and which is unique within the context of the system

Would this keep dereferenceability and still fit the need:

 "identifier": 
  [
    {
      "label": "NM_007294.3(BRCA1):c.5297T>G (p.Ile1766Ser)",
      "system": "http://www.ncbi.nlm.nih.gov/clinvar/variation/37656",
      "value": "nucleotide",
    }
  ],
larrybabb commented 9 years ago

So my opinion is that we either don't use clinvar identifiers as canonical identifiers? (don't really like it, but they are not the same as you said).

Or

We establish the following guideline when dealing with ClinVar alleles... The allele registry canonicalization rules will...

...use clinvar allele ids to identify canoncalAlleles under the following conditions:
1.  If the clinvar allele has any nucleotide representations then the clinvar allele will represent only a CanonicalNucleotideAllele.
2. If the clinvar allele has no nucleotide representations (only amino acid representations) then and only then will the clinvar allele id be used as a CanonicalAminoAcidAllele identifier

NOTE: even though clinvar embeds amino acid/protein representations alongside the nucleotide representations, this should be considered a denormalization or "convenience" particular to clinvar and should not be misconstrued as a canonicalization of the protein allele itself.

The Allele Registry can still provide the associations and generate the canonical amino acid alleles (generating new identifiers for those that have no independent clinvar identifier).

Just because Clinvar includes the protein data as attributes in the nucleotide alleles does not mean that we should consider them equivalent to aminoacid alleles.

Let the lowest level of representation define the clinvar identifier!

cbizon commented 9 years ago

Options:

1) Convince Clinvar to provide ids for nucleotide and amino acid alleles. Would help short-term, but we're going to run into issues with this in the future, since we won't convince everybody in every database to do this.

2) Only allow clinvar ID in the cases where it is unambiguous, and lose clinvar info otherwise. Not very satisfying, and runs into trouble if e.g. nucleotide info is added to a previously amino acid only allele at a later date.

3) Use IDs that point to clinvar but are not clinvar ids, like appending "#nucleotide" or "#aminoacid" to the clinvar id. The IRI for this guy is no longer dereferenceable (legal, but bad form perhaps) and there's the probabe confusion that arises since the clinvar system did not grant the given value any longer.

4) Use a different system with an IRI that we control to define IDs in terms of clinvar

"identifier": 
  [
    {
      "label": "NM_007294.3(BRCA1):c.5297T>G (p.Ile1766Ser)",
      "system": "http://clingen.org/mapping/clinvar/",
      "value": "37656#nucleotide"
    }
  ],

Solves the issues with 3 because we can make it dereferencable, with a link to the real clinvar page and an explaination, but kind of clunky.

5) Like #2, don't use clinvar ID's except when unambiguous, but define a relatedIdentifiers method or something similar to point to the clinvar record without claiming identity.

6) Like #2 don't use clinvar IDs except when unambiguous, but use provenance to relate the allele to a clinvar allele from which it was derived. This maybe only works if the allele comes from clinvar originally?

7) Redefine the system to include the clinvar allele, put the nucleotide / amino acid only in the value. The system remains dereferencable, but it's not really the "system" as we would understand it I think. And the IRI made from system+value is still not dereferenceable.

larrybabb commented 9 years ago

I'm not sure I fully grasp all of these, we can discuss further tomorrow. I still think we are safe with #2 as is, I really do not think (we can confirm) that NCBI would use the same allele id for a protein only allele and then change it to a nucleotide allele. I do believe they consider them different, even though they do not create separate protein alleles and associate them with the nucleotide alleles.

I really think it is going to be okay to declare clinvar as a repository of mututally exclusive nucleotide or amino acid alleles. And that identifying a protein allele by a clinvar allele id that is based on a nucleotide allele level is a mistake.

cbizon commented 9 years ago

Excerpt from an email with Donna:

For automated processing, ClinVar maps any assertion on a transcript sequence to the genome, looks at all transcripts aligned at that region, and re-translates any coding region in those transcripts.

For protein only assertions (usually OMIM or SCRP), we try to determine if there is only one nucleotide sequence that could generate that protein change. If so, we also represent as the nucleotide change. If not, we try to find time to check original literature (OMIM only) and find the nucleotide change that OMIM represented primarily as protein. Thus the significant number of records from OMIM not mapped to the genome.

So these are computationally generated, assuming basic molecular biology (although I think aware of selenocysteines)

My reading of this is that if there is a page in ClinVar that has both nucleotide and amino acid alleles, we can not assume that the original allele (or the intended allele, or the basic allele, or whatever) is the nucleotide allele.

cbizon commented 9 years ago

Based on Donna's email and further thoughts, here is the situation as I see it:

A ClinVar allele (Measure Set ID) refers to a group of simple alleles that include both nucleotide and amino acid alleles. To NCBI, these are all the same thing, so a single ID is applied to the set.

Identifying one of our canonical alleles with a clinvar allele will usually be incorrect for this reason. There is no primacy inherent in their grouping, as seen in Donna's email.

We should therefore not use clinvar ids in the identifier. However, we do want to maintain the relation to clinvar. I see two options: 1) Add a relatedAlleles element to CanonicalAllele. It would point to the clinvar measure set, maybe with some kind of qualification. 2) Handle this with provenance elements. (CanonicalAllele a -> derived from -> Clinvar MSID)

It is worth noting that the clinvar approach is the mainstream approach - so whatever solution we devise here will be used a lot with many different sources.

larrybabb commented 9 years ago

Updated model diagrams to include relatedIdentifier. I also, took the liberty of adding a formal primarykey-like ID attribute on all Resources and corresponding Conceptual classes. In addition, I added a version attribute to both the CanonicalAllele and ReferenceSequence resource classes, since we will need to truly need to maintain multiple instances of the same conceptually identified CanonicalAllele and ReferenceSequence for both internal and external stable accessibility.