ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
79 stars 32 forks source link

Implement support for transcript-based locations #199

Open reece opened 4 years ago

reece commented 4 years ago

Coordinates based on transcripts depend on the exon structure at least. Coding position requires also cds start and end.

Proposal: Implement TranscriptLocation, à la HGVS, by creating a class that stores coordinates on a specific transcript. The transcript reference should probably be a CURIE.

Ideally, the transcript will also depend on computed digest for the transcript (as UTA does) to de-dupe based on reference sequence, exon structure, and cds start/end.

github-actions[bot] commented 3 years ago

This issue was marked stale due to inactivity.

rrfreimuth commented 3 years ago

During the VR call on October 5, several ideas related to this topic were reviewed. Slides were posted here. At the end of the call, feedback was requested. This very long comment is my attempt to capture some of my thinking on this topic.

Guiding principles:

I recognize those principles may appear academic to some. There is a necessary balance to be struck with the pragmatic need to move things forward.

Comments on proposed models Model 1 (Reece):

Model 2 (Alex):

My thoughts I like how Model 2 builds on Model 1. I find Model 2 to be more aligned with how I view things, but would also like to offer enhancements for consideration.

  1. Clarifying the use of Interval and Location Take the following paraphrased definitions for these abstract classes:
    • Interval: a region specified by a counting system is defined (e.g., coordinates counting residues) that is used to denote start/end positions of the region. I don't think we have defined this abstract class in VRS.
    • Location: an Interval defined on a Sequence. The VRS defines Location as a "position of a contiguous segment of a biological sequence."

I believe the counting system defined by the Interval must be consistent with the logical subunits of the underlying Sequence for instances of these classes to make sense. If we were to define MolecularEntity (a concept not yet defined by VRS) as the abstract base class for Sequence and Chromosome (also not yet defined by VRS), then we could express the following:

From the perspective of the abstract parent classes: MolecularEntity + Interval => Location

Following that pattern, we can then create: Sequence + SequenceInterval => SequenceLocation, and Chromosome + ChromosomeInterval => ChromosomeLocation

Note that Sequence is a linear arrangement of residues and SequenceInterval is a region defined by a residue-based coordinate system, resulting in a SequenceLocation were both components have the same frame of reference (residues).

Similarly, Chromosome (an object TBD in VRS) and ChromosomeInterval both have logical "subunits" of cytobands.

Sidebar: A Chromosome may be associated with a Sequence, and given a mapping between cytobands and coordinates a translation can therefore occur between SequenceInterval and ChromosomeInterval (likewise the equivalent *Location objects).

  1. Defining FeatureInterval, TranscriptLocation, and AnnotatedSequenceLocation

The clarification of Interval and Location (above) is important when considering these new terms so that the base classes can be extended consistently.

Since Feature has not yet been defined, I propose the following working definition: Feature: an element that has structural or functional significance on a Sequence (e.g., exon, intron, coding region, gene)

Given this definition, and the one above for Interval, then to create the concept of FeatureInterval we must determine the "counting system" for the subunits of a Feature. Sequences are composed of residues that can be indexed using a simple coordinate system. Chromosomes are composed of cytobands that are referenced using a conceptual markers. What is the countable or indexable subunit of a Feature?

My struggle with TranscriptLocation is a little different. Assuming it inherits from the Location base class, that object would be composed of classes that specialize MolecularEntity and Interval. I might (still thinking on this) see how the Transcript class (as defined above) could specialize MolecularEntity (but NOT Sequence), but then we need to find a corresponding Interval. It might be tempting to base the "counting system" for TranscriptInterval on Exons, but then the Exon would be the defined as the smallest subunit and could not be split further. Therefore, we could refer to only whole Exons and the concept of "last half of Exon 4" would be nonsensical. This approach would be similar to, but not quite the same as, the subunit of Chromosome being the Cytoband (Exon and Cytoband are both examples of Features, but Cytoband has nomenclature that permit it to be subdivided further, whereas Exon does not).

I am uncertain of the difference between AnnotatedSequence (Model 2) and a Sequence with associated annotations (as might be constructed by the GKS Sequence Annotation subgroup). Furthermore, AnnotatedSequence would be the expected MolecularEntity of AnnotatedSequenceLocation, but the usage doesn't seem consistent and I'm not sure what the corresponding AnnotatedSequenceInterval would be (what is the logical subunit of an AnnotatedSequence?).

Summary (TL;DR)

reece commented 3 years ago

After 2 breaks and several national holidays, I finally worked my way through Bob's comment. :-)

Let's please drop discussion of imaginary sequences. It's incidental to the model. I will remove it from future discussions.

I care (very strongly) about one design goal: There should be only one notion of a collection of exons on a sequence. Just as Allele can be defined on any sequence, and just as one Allele might be used to derive another, a collection of exons on a specific sequence should have only one representation irrespective of sequence, and might be related to a similarly-structured set of exons on another sequence.

ahwagner commented 3 years ago

Thanks for sharing your thoughts @rrfreimuth. I'll put my responses here in the hope it will facilitate additional feedback.

I appreciate the call for semantic consistency, particularly as it relates to Locations. I would 👍 MolecularEntity, but the VR leads (myself, @larrybabb, and @reece for any newcomers to this org) previously debated and decided (with reservations) against the adoption of Chromosome as an entity, by virtue of not having stable versioned identifiers to point to for the concept. I think it will be useful to keep MolecularEntity in mind in case we come up with another entity that is parallel to Sequence (perhaps Transcript is this, as you suggest it might be... more on that below). Fortunately, the concept of Chromosome is transposable from our existing data structures, so MolecularEntity remains a useful way of thinking about Locations, even if it cannot formally encompass the chromosome representation in VRS.

I like the idea of Intervals being bound to the smallest subunit comprising a coordinate system. It is very VRS-like in my opinion, as this reduces duplication of concepts by disallowing the same coordinate range in a system to be expressed in multiple ways. I have previously considered the idea of a Full_Interval class as a shortcut, and ultimately decided against it for similar reasons.

As to Transcript being another MolecularEntity type, I agree that this starts to push the boundary a little. A Transcript (as we've defined it) is effectively a Sequence with exon and cds annotations on top of it. I have been putting aside my doubts about representing these as Value Objects, though I am concerned that we are painting ourselves into a corner. For example, why is cds important to capture as an intrinsic property of a transcript, but not structural motifs (which are described by some communities as context-independent, characteristic modules of RNA)? I think both models are taking a very domain-specific (i.e. human-mRNA-centric) approach to defining the biological concept of a transcript.

As to @reece's response to your comment, your point about imaginary intronic coordinates was similar to earlier comments I made in a leads call earlier this week. To explain his comment about it being incidental a little further, the notion of offset-based Interval classes is not a firm requirement of his model, and we have agreed that both models can start with SequenceInterval classes and build from there as needed to minimize the complexity of the initial proposal.

There is an unresolved sticking point between the two models, however: can the Transcript class be either a contiguous set of exons/cds on an RNA sequence or the discontiguous set of template exons/cds on a genomic sequence? Or is the Transcript class only the exons/cds on an RNA sequence, and the other is a distinct concept? The discussion on this point has gone quite deep, and while I won't rehash it here, suffice it to say that we're still working it through.

However, I am beginning to think that our struggles to converge on a common model are deeper than that. I think we have been glossing over the issue that Transcripts, Genes, and other conceptually complex entities are too multifaceted to pin down as value objects in VRS. You're correct about AnnotatedSequence being a Sequence with associated annotations on top of it–that's definitely the concept. I had hoped that the AnnotatedSequence object could live in the VRS domain and provide an interface to SA objects, but I think that as a middle-ground solution it is universally disliked (or at least, I haven't heard any support for it). ¯\_(ツ)_/¯

However, I am beginning to see a path where we model these complex entities as Features in the Sequence Annotation domain, build Variations using SequenceLocations or ChromosomeLocations, and decorate these VRS objects with those Features to convey the necessary context for HGVS reconstruction or other use cases identified down the road. This approach would converge with conversations happening in VA regarding molecular consequence contexts.

The good news is that the VR leads have collectively identified that resolving this issue is non-blocking, and that we can advance in parallel on our efforts to model Abundance. This is probably a good thing, as I think many related components are falling into place at once across VA, SA, and VR. In the meantime, the leads will continue to hash this out until a consensus decision is reached.

github-actions[bot] commented 3 years ago

This issue was marked stale due to inactivity.

andreasprlic commented 3 years ago

If there is a transcript representation that knows about CDS, it would be great to be able to indicate ribosomal frameshifting sites (translational slippage). Here an example how NCBI represents this currently: https://www.ncbi.nlm.nih.gov/nuccore/NM_001301302.1:


CDS             join(234..326,328..801)
                     /gene="OAZ2"
                     /gene_synonym="AZ2"
                     /ribosomal_slippage
                     /note="protein translation is dependent on
                     polyamine-induced +1 ribosomal frameshift; isoform 2 is
                     encoded by transcript variant 2; ODC-Az 2"
                     /codon_start=1
                     /product="ornithine decarboxylase antizyme 2 isoform 2"
                     /protein_id="NP_001288231.1"
                     /db_xref="GeneID:4947"
                     /db_xref="HGNC:HGNC:8096"
                     /db_xref="MIM:604152"
                     /translation="MINTQDSILPLSNCPQLQCCRHIVPGPLWCSDAPHPLSKIPGGR
                     GGGRDPSLSALIYKDEKLTVTQDLPVNDGKPHIVHFQYEVTEVKVSSWDAVLSSQSLF
                     VEIPDGLLADGSKEGLLALLEFAEEKMKVNYVFICFRKGREDRAPLLKTFSFLGFEIV
                     RPGHPCVPSRPDVMFMVYPLDQNLSDED"```
reece commented 3 years ago

Glad you raised this. How do you imagine that this would be used in practice? Would you report variation in the shifted or unshifted coordinates? Is the shifting a property of the transcript, or a property of the coordinates on the transcript?

If shifting is a property of the transcript, that would suggest an information model where a transcript consists of: a sequence, a set of exons on that sequence, an optional CDS <start,end>, and a new shift (signed int?). Is that what you had in mind?

andreasprlic commented 3 years ago

Ribosomal slippage happens in the middle of the mRNA, during translation. The way I would like to describe variants is relative to the protein-effect. To get this right, it will be important to be precise where exactly the variant is located. The variant could be either in the first part of a protein (in the unshifted region), or in the second part (the shifted region), or perhaps even overlap the site where the shift happens. (Note: I believe there are also examples with multiple shifts in one mRNA, not sure if they have been observed in human though)

I agree, if shifting is represented as a property of a transcript, the transcript would have a sequence, exons, and alist of CDS <start, end>. I don't think it needs a signed int, since the CDS<start,end> position would already include this. The example above refers to this as join(234..326,328..801). As you can see position 327 is part of an exon, but it is skipped and does not end up contributing to a codon.

github-actions[bot] commented 3 years ago

This issue was marked stale due to inactivity.

ahwagner commented 1 year ago

This should be revisited with the SA team with the upcoming gene and transcript models.