Implement support for transcript-based locations

reece commented 4 years ago

Coordinates based on transcripts depend on the exon structure at least. Coding position requires also cds start and end.

Proposal: Implement TranscriptLocation, à la HGVS, by creating a class that stores coordinates on a specific transcript. The transcript reference should probably be a CURIE.

Ideally, the transcript will also depend on computed digest for the transcript (as UTA does) to de-dupe based on reference sequence, exon structure, and cds start/end.

github-actions[bot] commented 3 years ago

This issue was marked stale due to inactivity.

rrfreimuth commented 3 years ago

During the VR call on October 5, several ideas related to this topic were reviewed. Slides were posted here. At the end of the call, feedback was requested. This very long comment is my attempt to capture some of my thinking on this topic.

Guiding principles:

I believe it is of paramount importance to clearly define the entities that we model. The VMC effort spent considerable time establishing common vocabulary, which has paid dividends as this work continues.
I believe it is important to maintain rigor in our semantics and strive to eliminate ambiguity, building on the success of our foundational work on the VRS.
I believe it is essential for our model to be unerringly internally coherent (semantically) so that it is predictably extensible, an important capability for us to maintain as the VRS expands to support additional types of variation and use cases.

I recognize those principles may appear academic to some. There is a necessary balance to be struck with the pragmatic need to move things forward.

Comments on proposed models Model 1 (Reece):

Scope: I agree that VRS should support the definition of and reference to arbitrary intervals on genomic sequence, transcripts, and protein sequences.
I think it is important to be semantically precise, so assertions or annotations should be modeled as appropriate. This may mean some entities fall to the VA or SA subgroups.
I believe imaginary locations (i.e., those that do not exist on the subject sequence) are nonsensical and non-computable, but that does not mean we cannot support them. I believe they can be modeled as a more complex structure, if needed.
Based on the VRS definition of Sequence and the proposed model, a Transcript is NOT a Sequence (just making this explicit). Instead, a Transcript is a complex object that HAS a Sequence and at least two types of defined Features (exons and coding sequence/CDS). I think I am comfortable with the working definition as presented, but we should add some constraints (e.g., Exons must be adjoining, non-overlapping, and contiguous over the entire extent of the Transcript) and consider other attributes (e.g., genetic locus).

Model 2 (Alex):

Good example illustrating HGVS recommendations that require the specification of a genomic reference sequence when an intronic coordinate is used. I agree that the coordinate space refers to the genomic sequence and it is simply expressed relative to the transcript's sequence.
Transcript model is the same as in Model 1 (this is a Good Thing). I believe SequenceInterval might be better used as the datatype for Exons and CDS, since we want to specify a coordinate-based counting system.
I like the approach proposed in the conceptual model, but I am not yet sold on the name FeatureInterval. I think we need to work on the base classes a bit (see below). The conceptual distinction between the Sequence, AnnotatedSequence, and Transcript is helpful.
The examples are very clear, but I am not sure about the names TranscriptLocation and AnnotatedSequenceLocation. I think they are on the right track but I'm not yet convinced they are completely aligned with the base classes.
I am particularly excited about the flexibility illustrated by scenarios 1-3, and the non-imaginary nature of scenario 3.

My thoughts I like how Model 2 builds on Model 1. I find Model 2 to be more aligned with how I view things, but would also like to offer enhancements for consideration.

Clarifying the use of Interval and Location Take the following paraphrased definitions for these abstract classes:
- Interval: a region specified by a counting system is defined (e.g., coordinates counting residues) that is used to denote start/end positions of the region. I don't think we have defined this abstract class in VRS.
- Location: an Interval defined on a Sequence. The VRS defines Location as a "position of a contiguous segment of a biological sequence."

I believe the counting system defined by the Interval must be consistent with the logical subunits of the underlying Sequence for instances of these classes to make sense. If we were to define MolecularEntity (a concept not yet defined by VRS) as the abstract base class for Sequence and Chromosome (also not yet defined by VRS), then we could express the following:

From the perspective of the abstract parent classes: MolecularEntity + Interval => Location

Following that pattern, we can then create: Sequence + SequenceInterval => SequenceLocation, and Chromosome + ChromosomeInterval => ChromosomeLocation

Note that Sequence is a linear arrangement of residues and SequenceInterval is a region defined by a residue-based coordinate system, resulting in a SequenceLocation were both components have the same frame of reference (residues).

Similarly, Chromosome (an object TBD in VRS) and ChromosomeInterval both have logical "subunits" of cytobands.

Sidebar: A Chromosome may be associated with a Sequence, and given a mapping between cytobands and coordinates a translation can therefore occur between SequenceInterval and ChromosomeInterval (likewise the equivalent *Location objects).

Defining FeatureInterval, TranscriptLocation, and AnnotatedSequenceLocation

The clarification of Interval and Location (above) is important when considering these new terms so that the base classes can be extended consistently.

Since Feature has not yet been defined, I propose the following working definition: Feature: an element that has structural or functional significance on a Sequence (e.g., exon, intron, coding region, gene)

Given this definition, and the one above for Interval, then to create the concept of FeatureInterval we must determine the "counting system" for the subunits of a Feature. Sequences are composed of residues that can be indexed using a simple coordinate system. Chromosomes are composed of cytobands that are referenced using a conceptual markers. What is the countable or indexable subunit of a Feature?

My struggle with TranscriptLocation is a little different. Assuming it inherits from the Location base class, that object would be composed of classes that specialize MolecularEntity and Interval. I might (still thinking on this) see how the Transcript class (as defined above) could specialize MolecularEntity (but NOT Sequence), but then we need to find a corresponding Interval. It might be tempting to base the "counting system" for TranscriptInterval on Exons, but then the Exon would be the defined as the smallest subunit and could not be split further. Therefore, we could refer to only whole Exons and the concept of "last half of Exon 4" would be nonsensical. This approach would be similar to, but not quite the same as, the subunit of Chromosome being the Cytoband (Exon and Cytoband are both examples of Features, but Cytoband has nomenclature that permit it to be subdivided further, whereas Exon does not).

I am uncertain of the difference between AnnotatedSequence (Model 2) and a Sequence with associated annotations (as might be constructed by the GKS Sequence Annotation subgroup). Furthermore, AnnotatedSequence would be the expected MolecularEntity of AnnotatedSequenceLocation, but the usage doesn't seem consistent and I'm not sure what the corresponding AnnotatedSequenceInterval would be (what is the logical subunit of an AnnotatedSequence?).

Summary (TL;DR)

Both models were helpful in getting this conversation started
I prefer Model 2 as a starting point -- I like the capabilities of expressing intronic locations using a linked genomic reference -- I think we need to continue to work on the semantics of the composing classes (might simply need adjustments to the names)
I like weedy semantics

reece commented 3 years ago

After 2 breaks and several national holidays, I finally worked my way through Bob's comment. :-)

Let's please drop discussion of imaginary sequences. It's incidental to the model. I will remove it from future discussions.

I care (very strongly) about one design goal: There should be only one notion of a collection of exons on a sequence. Just as Allele can be defined on any sequence, and just as one Allele might be used to derive another, a collection of exons on a specific sequence should have only one representation irrespective of sequence, and might be related to a similarly-structured set of exons on another sequence.

ahwagner commented 3 years ago

Thanks for sharing your thoughts @rrfreimuth. I'll put my responses here in the hope it will facilitate additional feedback.

I appreciate the call for semantic consistency, particularly as it relates to Locations. I would 👍 MolecularEntity, but the VR leads (myself, @larrybabb, and @reece for any newcomers to this org) previously debated and decided (with reservations) against the adoption of Chromosome as an entity, by virtue of not having stable versioned identifiers to point to for the concept. I think it will be useful to keep MolecularEntity in mind in case we come up with another entity that is parallel to Sequence (perhaps Transcript is this, as you suggest it might be... more on that below). Fortunately, the concept of Chromosome is transposable from our existing data structures, so MolecularEntity remains a useful way of thinking about Locations, even if it cannot formally encompass the chromosome representation in VRS.

I like the idea of Intervals being bound to the smallest subunit comprising a coordinate system. It is very VRS-like in my opinion, as this reduces duplication of concepts by disallowing the same coordinate range in a system to be expressed in multiple ways. I have previously considered the idea of a Full_Interval class as a shortcut, and ultimately decided against it for similar reasons.

As to Transcript being another MolecularEntity type, I agree that this starts to push the boundary a little. A Transcript (as we've defined it) is effectively a Sequence with exon and cds annotations on top of it. I have been putting aside my doubts about representing these as Value Objects, though I am concerned that we are painting ourselves into a corner. For example, why is cds important to capture as an intrinsic property of a transcript, but not structural motifs (which are described by some communities as context-independent, characteristic modules of RNA)? I think both models are taking a very domain-specific (i.e. human-mRNA-centric) approach to defining the biological concept of a transcript.

As to @reece's response to your comment, your point about imaginary intronic coordinates was similar to earlier comments I made in a leads call earlier this week. To explain his comment about it being incidental a little further, the notion of offset-based Interval classes is not a firm requirement of his model, and we have agreed that both models can start with SequenceInterval classes and build from there as needed to minimize the complexity of the initial proposal.

There is an unresolved sticking point between the two models, however: can the Transcript class be either a contiguous set of exons/cds on an RNA sequence or the discontiguous set of template exons/cds on a genomic sequence? Or is the Transcript class only the exons/cds on an RNA sequence, and the other is a distinct concept? The discussion on this point has gone quite deep, and while I won't rehash it here, suffice it to say that we're still working it through.

However, I am beginning to think that our struggles to converge on a common model are deeper than that. I think we have been glossing over the issue that Transcripts, Genes, and other conceptually complex entities are too multifaceted to pin down as value objects in VRS. You're correct about AnnotatedSequence being a Sequence with associated annotations on top of it–that's definitely the concept. I had hoped that the AnnotatedSequence object could live in the VRS domain and provide an interface to SA objects, but I think that as a middle-ground solution it is universally disliked (or at least, I haven't heard any support for it). ¯\_(ツ)_/¯

However, I am beginning to see a path where we model these complex entities as Features in the Sequence Annotation domain, build Variations using SequenceLocations or ChromosomeLocations, and decorate these VRS objects with those Features to convey the necessary context for HGVS reconstruction or other use cases identified down the road. This approach would converge with conversations happening in VA regarding molecular consequence contexts.

The good news is that the VR leads have collectively identified that resolving this issue is non-blocking, and that we can advance in parallel on our efforts to model Abundance. This is probably a good thing, as I think many related components are falling into place at once across VA, SA, and VR. In the meantime, the leads will continue to hash this out until a consensus decision is reached.

github-actions[bot] commented 3 years ago

This issue was marked stale due to inactivity.

andreasprlic commented 3 years ago

If there is a transcript representation that knows about CDS, it would be great to be able to indicate ribosomal frameshifting sites (translational slippage). Here an example how NCBI represents this currently: https://www.ncbi.nlm.nih.gov/nuccore/NM_001301302.1:


CDS             join(234..326,328..801)
                     /gene="OAZ2"
                     /gene_synonym="AZ2"
                     /ribosomal_slippage
                     /note="protein translation is dependent on
                     polyamine-induced +1 ribosomal frameshift; isoform 2 is
                     encoded by transcript variant 2; ODC-Az 2"
                     /codon_start=1
                     /product="ornithine decarboxylase antizyme 2 isoform 2"
                     /protein_id="NP_001288231.1"
                     /db_xref="GeneID:4947"
                     /db_xref="HGNC:HGNC:8096"
                     /db_xref="MIM:604152"
                     /translation="MINTQDSILPLSNCPQLQCCRHIVPGPLWCSDAPHPLSKIPGGR
                     GGGRDPSLSALIYKDEKLTVTQDLPVNDGKPHIVHFQYEVTEVKVSSWDAVLSSQSLF
                     VEIPDGLLADGSKEGLLALLEFAEEKMKVNYVFICFRKGREDRAPLLKTFSFLGFEIV
                     RPGHPCVPSRPDVMFMVYPLDQNLSDED"```

reece commented 3 years ago

Glad you raised this. How do you imagine that this would be used in practice? Would you report variation in the shifted or unshifted coordinates? Is the shifting a property of the transcript, or a property of the coordinates on the transcript?

If shifting is a property of the transcript, that would suggest an information model where a transcript consists of: a sequence, a set of exons on that sequence, an optional CDS <start,end>, and a new shift (signed int?). Is that what you had in mind?

andreasprlic commented 3 years ago

Ribosomal slippage happens in the middle of the mRNA, during translation. The way I would like to describe variants is relative to the protein-effect. To get this right, it will be important to be precise where exactly the variant is located. The variant could be either in the first part of a protein (in the unshifted region), or in the second part (the shifted region), or perhaps even overlap the site where the shift happens. (Note: I believe there are also examples with multiple shifts in one mRNA, not sure if they have been observed in human though)

I agree, if shifting is represented as a property of a transcript, the transcript would have a sequence, exons, and alist of CDS <start, end>. I don't think it needs a signed int, since the CDS<start,end> position would already include this. The example above refers to this as join(234..326,328..801). As you can see position 327 is part of an exon, but it is skipped and does not end up contributing to a codon.

github-actions[bot] commented 3 years ago

This issue was marked stale due to inactivity.

ahwagner commented 1 year ago

This should be revisited with the SA team with the upcoming gene and transcript models.

ga4gh / vrs

Implement support for transcript-based locations #199