clingen-data-model / allele

Documentation for data model of ClinGen
10 stars 2 forks source link

Canonicalization issue #163

Closed ppawliczek closed 3 years ago

ppawliczek commented 8 years ago

There is an issue we found during canonicalization. It may take place when deletion or insertion occurs next to splicing site. In this case one canonical allele on the transcript level may correspond to two canonical alleles on the genome level. Please, look at the following image for an example. transcript_deletion

cbizon commented 8 years ago

This is a really interesting example. when we made the first version of this model, we assumed that amino acid alleles were really a different entity from genomic alleles, because multiple genomic alleles would lead to the same AA allele. But, we thought, the same was not the case for genomic and transcript alleles. This is a good counter-example to that last point. In fact, there are two genomic alleles that are clearly distinct that are indistinguishable when viewed at the transcript level.

So what to do about it? I propose a short-term/long-term split, but I am far from sure that this is the correct answer:

Short term: The canonicalizer should treat the transcript alleles as different, each canonicalizing to its genomic allele, but not to each other. This is clearly non-optimal, but I think it's the best we can do without the long-term solution.

Long term: Incorporate transcript/genome mappings into the allele model, and break out transcript alleles as a separate entity. In this view, canonicalizing will only be done within each layer, and the projections to other layers will be to separate entities.

larrybabb commented 8 years ago

I lean towards the long term solution above. In practice labs test and observe DNA, RNA and AA sequences. I know we have tried to simplify that world by combining the RNA and DNA contextual alleles into a single canonical allele, but as we are now seeing it is conceptually not correct.
The HL7 FHIR spec provides labs to represent DNA, RNA and AA sequences as different records.

larrybabb commented 8 years ago

@ppawliczek do you have a reference to the real entry/submission in clinvar that ended up highlighting this issue? It would be helpful when showing other groups this particular use case.

rrfreimuth commented 8 years ago

For the record, I agree with the long term solution – we should canonicalize genomic, RNA, and AA sequences separately. In some cases it may be possible to assert relationships between the layers (i.e., when the sequences make it unambiguous to do so, although AA=>RNA/DNA may not be nearly as common).

We should also consider how we might canonicalize sequences when it might not be possible to determine exactly which base is inserted/deleted/duplicated, etc. (We’ve discussed that issue before…)

Thanks, Bob

From: Larry Babb [mailto:notifications@github.com] Sent: Tuesday, February 02, 2016 2:04 PM To: clingen-data-model/clingen-data-model Subject: Re: [clingen-data-model] Canonicalization issue (#163)

I lean towards the long term solution above. In practice labs test and observe DNA, RNA and AA sequences. I know we have tried to simplify that world by combining the RNA and DNA contextual alleles into a single canonical allele, but as we are now seeing it is conceptually not correct.

The HL7 FHIR spec provides labs to represent DNA, RNA and AA sequences as different records.

— Reply to this email directly or view it on GitHubhttps://github.com/clingen-data-model/clingen-data-model/issues/163#issuecomment-178789881.