clingen-data-model / allele

Documentation for data model of ClinGen
10 stars 2 forks source link

Insonsistency in definition of Contextual Allele #161

Closed ppawliczek closed 3 years ago

ppawliczek commented 8 years ago

Hi Everyone, I have run into some issues, any help will be appreciated. Let me know if something is unclear.

I think that there is some incosistency in Contextual Allele definition. Let's consider the following example:

reference sequence: ACGTCCGTATGGC alternate one : ACGTCATTATGGC

So we have here the following indel in range 5-7: CG->AT

The following allele A corresponding to this indel is registered: 5-7,AT (coded as region and a sequence).

What about registering the following alleles B and C? B: 4-7,CAT C: 5-8,ATT

If A = B = C? In my mind they should be all canonicalized as allele A.

We have allele A. Let's assume that we want to define allele D which corresponds to original sequence replaced by A (D = ~A). How to do that? D: 5-7,CG Let's consider alleles E, F and G: E: 4-7,CCG F: 5-8,CGT G: 7-9,TA

If D = E = F ? D = E = F = G ?

Alleles D,E,F,G do not change the reference sequence, so we do not know how to canonicalize them. There is no clue where is the beginning and the end of the allele D without definition of alternate allele A.

Summary: I think that the two following requirements cannot be fulfill at the same time:

  1. User can register any allele, defined as region & new sequence. (boundaries are taken from user)
  2. Allele definitions which result in the same sequence are grouped together as single canonical allele. (boundaries are calculated)

ad. 1) In this case A != B, A != C & B != C (analogical situation for D,E,F,G). But there is no place for real canonicalization.

ad. 2) In this case A = B = C. But D,E,F,G cannot be defined as an allele because we cannot calculate their coordinates.

srynobio commented 8 years ago

Let me try to understand based on your example:

A:
R ACGTC[CG]TATGGC
A ACGTC[AT]TATGGC

B:
R ACGT[CGT]ATGGC
A ACGT[CAT]ATGGC

C:
R ACGTC[CGT]ATGGC
A ACGTC[ATT]ATGGC

I would consider these three different alleles so A ≠ B ≠ C

Also this would violate what an Allele is: Alleles D,E,F,G do not change the reference sequence, so we do not know how to canonicalize them

Ultimately what you're canonicalizing is different contextual alleles which represent the same change in regards to the same reference.

ppawliczek commented 8 years ago

So A, B and C are not "different contextual alleles which represent the same change in regards to the same reference" ?

ppawliczek commented 8 years ago

The problem is that we use Contextual Allele in two different contexts: ad. 1) As a marker of any subsequence of reference ad. 2) As a definition of possible change (variation) in the reference

srynobio commented 8 years ago

If you review the introduction on the allele model page:

The purpose of the Allele Data Model is to provide a referenceable entity that represents the choice of particular allele at a site of genetic variation. This entity should be resilient to updates in reference sequences, while containing sufficient information to uniquely specify the allele.

ppawliczek commented 8 years ago

According to this definition we can define only alleles A and D (so A=B=C). The problem is that in this case we cannot define D without information about A. Or maybe D is a part of definition of A?