clingen-data-model / allele

Documentation for data model of ClinGen
10 stars 2 forks source link

genbank feature locations #135

Closed jdr0887 closed 3 years ago

jdr0887 commented 9 years ago

How are the feature locations modeled? Of particular interest are the complex expressions: join(complement(4918..5163),complement(2691..4571))

srynobio commented 9 years ago

Hi Jason,

I'm sorry could you offer a more detail explanation?

Thanks

jdr0887 commented 9 years ago

I can try....

The genbank files from ncbi for refseq have a "features" section. Each feature has a "location". These locations can be expression based (see http://www.insdc.org/files/feature_table.html#3.4.3). Within transcript ReferenceSequence instances, I can apply alignment/region information from these features, calculating introns & UTRs. Is it the intent of ClinGen to model alignments? If so, how are these complex regions modeled in ClinGen?

larrybabb commented 9 years ago

It is not the intent to model the alignments. We discussed this early on and determined that our initial modeling scope would not include this level of detail. @cbizon may have some additional thoughts on the topic.

cbizon commented 9 years ago

I agree that the model does not include alignments (though I increasingly think that may be something we want to revisit in a future iteration).

However, regions can be used for more that just alignments: for instance, the CDS can be written as a region on a transcript, and there are a bunch of other regions that somebody might want to talk about on a transcript, like the features that Jason mentions above.

In fact, one thing that we might find disturbing for the current allele model is that the CDS is sometimes not a simple region in refseq, as in this example. http://www.ncbi.nlm.nih.gov/nuccore/NM_004152.1 What that's the case, then a simple start/stop for the CDS is insufficient for going back and forth between transcript and coding coordinates.

If we include regions to fix that problem, then it probably makes sense to reuse those regions for other purposes.