clingen-data-model / allele

Documentation for data model of ClinGen
10 stars 2 forks source link

IntronOffsetStart and IntronOffsetEnd with different directions #143

Closed ppawliczek closed 9 years ago

ppawliczek commented 9 years ago

It seems to me that HGVS allows for definition of ranges with begin and end in different introns (for coding DNA), e.g.: c.100-5_200+4 The problem arise when two intron offsets have different signs. How can we save something like that? In the SimpleAllele object there is one common intronOffsetDirection field for both intronOffsetStart and intronOffsetEnd fields.

ronakypatel commented 9 years ago

probably something like this representation?

NM_001035.2:c.169-199_273+819del

http://www.ncbi.nlm.nih.gov/clinvar/variation/29879/

cbizon commented 9 years ago

This is a good point. The model as currently constituted does not allow for this easily. Essentially, to fix this, we need to break up start/end to have different intron offsets

cbizon commented 9 years ago

This is fixed in conceptual, but it still needs resources and docs

larrybabb commented 9 years ago

I have attached two images (conceptual and resource) for what I am currently settled on. I know we sort of jammed something into the conceptual model as a team (added "start"/"end" associations to IntronOffset), but as I tried to carry it through the Resource side it became somewhat apparent that it needed further tweaking.

There's a part of me that thinks we should consider cDNA intronic representation to be out of scope for our model, HGVS intronic nomenclature is fundamentally flawed in that it does not have the supporting sequence included in the transcript sequence it is associated with.

But – since that probably doesn't sit well with Chris B (at the very least). I would like you to look over the diagrams and let me know if they seem acceptable. (The conceptual one is a bit tricky)

The essence of the change is that each end of the coordinate (start & end) needs to be broken out into a separate conceptual entity (I.e. Position). Then each end can have the option to associate it with a genomic position and the additional offset length and direction. This will allow either (or both) ends of the coordinate to be included in the intronic region.

BTW – I put the genomicReferenceSequence association in the ReferenceCoordinate class and assumed that if both start and end positions where in the intronic region, that they would be based on the ReferenceCoordinate's single genomic reference sequence. I didn't want to think about having each intronic position based on different genomic reference sequences (too much complexity).

The kicker is the impact on the primaryTranscriptRegionType. Since, I think, this is really trying to classify the entire coordinate's "region" (not allele change) then there isn't really just one primary for situations that span the exon and intron. So, please help me work this out.

All comments are welcome. If you think this should be in github, please feel free to transfer. Otherwise, I will summarize once we get some traction on a solution.

Please keep in mind – there is an inordinate amount of complexity added to the model to support cDNA based intronic alleleInstances. We need to really understand whether this is something we can simply bypass and state "all allele instances must have reference coordinates based in a single reference sequence". Then we can determine if it makes sense to simply make the HGVS cDNA intronic expressions a special type of non-validated name. Not great options. resourcealleleinstance conceptualalleleinstance

larrybabb commented 9 years ago

Here's an example of a transcript intronic variant that starts in the exon and ends in the intron

http://www.ncbi.nlm.nih.gov/clinvar/variation/16148/