Open reece opened 4 years ago
This issue was marked stale due to inactivity.
During the VR call on October 5, several ideas related to this topic were reviewed. Slides were posted here. At the end of the call, feedback was requested. This very long comment is my attempt to capture some of my thinking on this topic.
Guiding principles:
I recognize those principles may appear academic to some. There is a necessary balance to be struck with the pragmatic need to move things forward.
Comments on proposed models Model 1 (Reece):
Sequence
and the proposed model, a Transcript
is NOT a Sequence
(just making this explicit). Instead, a Transcript
is a complex object that HAS a Sequence
and at least two types of defined Features (exons and coding sequence/CDS). I think I am comfortable with the working definition as presented, but we should add some constraints (e.g., Exons must be adjoining, non-overlapping, and contiguous over the entire extent of the Transcript) and consider other attributes (e.g., genetic locus).Model 2 (Alex):
SequenceInterval
might be better used as the datatype for Exons and CDS, since we want to specify a coordinate-based counting system.FeatureInterval
. I think we need to work on the base classes a bit (see below). The conceptual distinction between the Sequence, AnnotatedSequence, and Transcript is helpful.My thoughts I like how Model 2 builds on Model 1. I find Model 2 to be more aligned with how I view things, but would also like to offer enhancements for consideration.
Interval
and Location
Take the following paraphrased definitions for these abstract classes:
I believe the counting system defined by the Interval
must be consistent with the logical subunits of the underlying Sequence
for instances of these classes to make sense. If we were to define MolecularEntity
(a concept not yet defined by VRS) as the abstract base class for Sequence
and Chromosome
(also not yet defined by VRS), then we could express the following:
From the perspective of the abstract parent classes:
MolecularEntity
+ Interval
=> Location
Following that pattern, we can then create:
Sequence
+ SequenceInterval
=> SequenceLocation
, and
Chromosome
+ ChromosomeInterval
=> ChromosomeLocation
Note that Sequence
is a linear arrangement of residues and SequenceInterval
is a region defined by a residue-based coordinate system, resulting in a SequenceLocation
were both components have the same frame of reference (residues).
Similarly, Chromosome
(an object TBD in VRS) and ChromosomeInterval
both have logical "subunits" of cytobands.
Sidebar: A Chromosome
may be associated with a Sequence
, and given a mapping between cytobands and coordinates a translation can therefore occur between SequenceInterval
and ChromosomeInterval
(likewise the equivalent *Location objects).
FeatureInterval
, TranscriptLocation
, and AnnotatedSequenceLocation
The clarification of Interval
and Location
(above) is important when considering these new terms so that the base classes can be extended consistently.
Since Feature
has not yet been defined, I propose the following working definition:
Feature: an element that has structural or functional significance on a Sequence (e.g., exon, intron, coding region, gene)
Given this definition, and the one above for Interval
, then to create the concept of FeatureInterval
we must determine the "counting system" for the subunits of a Feature
. Sequences are composed of residues that can be indexed using a simple coordinate system. Chromosomes are composed of cytobands that are referenced using a conceptual markers. What is the countable or indexable subunit of a Feature
?
My struggle with TranscriptLocation
is a little different. Assuming it inherits from the Location
base class, that object would be composed of classes that specialize MolecularEntity
and Interval
. I might (still thinking on this) see how the Transcript
class (as defined above) could specialize MolecularEntity
(but NOT Sequence
), but then we need to find a corresponding Interval
. It might be tempting to base the "counting system" for TranscriptInterval
on Exons, but then the Exon would be the defined as the smallest subunit and could not be split further. Therefore, we could refer to only whole Exons and the concept of "last half of Exon 4" would be nonsensical. This approach would be similar to, but not quite the same as, the subunit of Chromosome
being the Cytoband
(Exon and Cytoband are both examples of Features, but Cytoband has nomenclature that permit it to be subdivided further, whereas Exon does not).
I am uncertain of the difference between AnnotatedSequence (Model 2) and a Sequence
with associated annotations (as might be constructed by the GKS Sequence Annotation subgroup). Furthermore, AnnotatedSequence would be the expected MolecularEntity
of AnnotatedSequenceLocation
, but the usage doesn't seem consistent and I'm not sure what the corresponding AnnotatedSequenceInterval would be (what is the logical subunit of an AnnotatedSequence?).
Summary (TL;DR)
After 2 breaks and several national holidays, I finally worked my way through Bob's comment. :-)
Let's please drop discussion of imaginary sequences. It's incidental to the model. I will remove it from future discussions.
I care (very strongly) about one design goal: There should be only one notion of a collection of exons on a sequence. Just as Allele can be defined on any sequence, and just as one Allele might be used to derive another, a collection of exons on a specific sequence should have only one representation irrespective of sequence, and might be related to a similarly-structured set of exons on another sequence.
Thanks for sharing your thoughts @rrfreimuth. I'll put my responses here in the hope it will facilitate additional feedback.
I appreciate the call for semantic consistency, particularly as it relates to Locations
. I would 👍 MolecularEntity
, but the VR leads (myself, @larrybabb, and @reece for any newcomers to this org) previously debated and decided (with reservations) against the adoption of Chromosome
as an entity, by virtue of not having stable versioned identifiers to point to for the concept. I think it will be useful to keep MolecularEntity
in mind in case we come up with another entity that is parallel to Sequence
(perhaps Transcript
is this, as you suggest it might be... more on that below). Fortunately, the concept of Chromosome
is transposable from our existing data structures, so MolecularEntity
remains a useful way of thinking about Locations
, even if it cannot formally encompass the chromosome representation in VRS.
I like the idea of Intervals
being bound to the smallest subunit comprising a coordinate system. It is very VRS-like in my opinion, as this reduces duplication of concepts by disallowing the same coordinate range in a system to be expressed in multiple ways. I have previously considered the idea of a Full_Interval
class as a shortcut, and ultimately decided against it for similar reasons.
As to Transcript
being another MolecularEntity
type, I agree that this starts to push the boundary a little. A Transcript
(as we've defined it) is effectively a Sequence
with exon and cds annotations on top of it. I have been putting aside my doubts about representing these as Value Objects, though I am concerned that we are painting ourselves into a corner. For example, why is cds important to capture as an intrinsic property of a transcript, but not structural motifs (which are described by some communities as context-independent, characteristic modules of RNA)? I think both models are taking a very domain-specific (i.e. human-mRNA-centric) approach to defining the biological concept of a transcript.
As to @reece's response to your comment, your point about imaginary intronic coordinates was similar to earlier comments I made in a leads call earlier this week. To explain his comment about it being incidental a little further, the notion of offset-based Interval
classes is not a firm requirement of his model, and we have agreed that both models can start with SequenceInterval
classes and build from there as needed to minimize the complexity of the initial proposal.
There is an unresolved sticking point between the two models, however: can the Transcript class be either a contiguous set of exons/cds on an RNA sequence or the discontiguous set of template exons/cds on a genomic sequence? Or is the Transcript class only the exons/cds on an RNA sequence, and the other is a distinct concept? The discussion on this point has gone quite deep, and while I won't rehash it here, suffice it to say that we're still working it through.
However, I am beginning to think that our struggles to converge on a common model are deeper than that. I think we have been glossing over the issue that Transcripts
, Genes
, and other conceptually complex entities are too multifaceted to pin down as value objects in VRS. You're correct about AnnotatedSequence
being a Sequence
with associated annotations on top of it–that's definitely the concept. I had hoped that the AnnotatedSequence
object could live in the VRS domain and provide an interface to SA objects, but I think that as a middle-ground solution it is universally disliked (or at least, I haven't heard any support for it). ¯\_(ツ)_/¯
However, I am beginning to see a path where we model these complex entities as Features
in the Sequence Annotation domain, build Variations
using SequenceLocations
or ChromosomeLocations
, and decorate these VRS objects with those Features
to convey the necessary context for HGVS reconstruction or other use cases identified down the road. This approach would converge with conversations happening in VA regarding molecular consequence contexts.
The good news is that the VR leads have collectively identified that resolving this issue is non-blocking, and that we can advance in parallel on our efforts to model Abundance
. This is probably a good thing, as I think many related components are falling into place at once across VA, SA, and VR. In the meantime, the leads will continue to hash this out until a consensus decision is reached.
This issue was marked stale due to inactivity.
If there is a transcript representation that knows about CDS, it would be great to be able to indicate ribosomal frameshifting sites (translational slippage). Here an example how NCBI represents this currently: https://www.ncbi.nlm.nih.gov/nuccore/NM_001301302.1:
CDS join(234..326,328..801)
/gene="OAZ2"
/gene_synonym="AZ2"
/ribosomal_slippage
/note="protein translation is dependent on
polyamine-induced +1 ribosomal frameshift; isoform 2 is
encoded by transcript variant 2; ODC-Az 2"
/codon_start=1
/product="ornithine decarboxylase antizyme 2 isoform 2"
/protein_id="NP_001288231.1"
/db_xref="GeneID:4947"
/db_xref="HGNC:HGNC:8096"
/db_xref="MIM:604152"
/translation="MINTQDSILPLSNCPQLQCCRHIVPGPLWCSDAPHPLSKIPGGR
GGGRDPSLSALIYKDEKLTVTQDLPVNDGKPHIVHFQYEVTEVKVSSWDAVLSSQSLF
VEIPDGLLADGSKEGLLALLEFAEEKMKVNYVFICFRKGREDRAPLLKTFSFLGFEIV
RPGHPCVPSRPDVMFMVYPLDQNLSDED"```
Glad you raised this. How do you imagine that this would be used in practice? Would you report variation in the shifted or unshifted coordinates? Is the shifting a property of the transcript, or a property of the coordinates on the transcript?
If shifting is a property of the transcript, that would suggest an information model where a transcript consists of: a sequence, a set of exons on that sequence, an optional CDS <start,end>, and a new shift (signed int?). Is that what you had in mind?
Ribosomal slippage happens in the middle of the mRNA, during translation. The way I would like to describe variants is relative to the protein-effect. To get this right, it will be important to be precise where exactly the variant is located. The variant could be either in the first part of a protein (in the unshifted region), or in the second part (the shifted region), or perhaps even overlap the site where the shift happens. (Note: I believe there are also examples with multiple shifts in one mRNA, not sure if they have been observed in human though)
I agree, if shifting is represented as a property of a transcript, the transcript would have a sequence, exons, and alist of CDS <start, end>. I don't think it needs a signed int, since the CDS<start,end> position would already include this. The example above refers to this as join(234..326,328..801)
. As you can see position 327 is part of an exon, but it is skipped and does not end up contributing to a codon.
This issue was marked stale due to inactivity.
This should be revisited with the SA team with the upcoming gene and transcript models.
Coordinates based on transcripts depend on the exon structure at least. Coding position requires also cds start and end.
Proposal: Implement TranscriptLocation, à la HGVS, by creating a class that stores coordinates on a specific transcript. The transcript reference should probably be a CURIE.
Ideally, the transcript will also depend on computed digest for the transcript (as UTA does) to de-dupe based on reference sequence, exon structure, and cds start/end.