Closed reece closed 7 months ago
Suggested changes to @mihailefter's diagram for discussion:
@ifokkema @jtdendunnen @ahwagner @jfjlaros @marinadistefano : Any comments on this thread of work?
A few remarks on the suggested changes:
c.
and n.
coordinate systems are genomic, they should be in the DNA box.r.
). Therefore, we cannot use both c.
and n.
in one picture as it it unclear to which coordinate system the r.
coordinates are related to. I have actually considered proposing a coordinate system for non-coding RNA (s.
for structural) to fill this gap.Thanks Reese. I like that proposed diagram a lot and agree with Jeroen's comments.
@jfjlaros I have a different view, and I think we can unify them.
The coordinate system -- ie., how to interpret coordinates -- is integers (for g. and p.) or base+offset positions (c., n., r.).
The reference sequence is dna, rna, or aa.
The distinction between c, n, r is NOT coordinate system, but rather reference sequence. Capturing this similarity seems usefult to me.
Our variant types (c, g, m, n, p, r) convey both of the above issues at the same time.
So, my proposal is to have
This would allow us to better express the relationship between the coordinate system and reference sequence type.
@reece I agree that the distinction should be made based on the reference sequence. This is in line with my suggestion to group the coordinate systems as {g.
, m.
, o.
, c.
, n.
} for DNA, {r.
} for RNA and {p.
} for proteins.
As mentioned in #147, r.
coordinates do not have offsets because they refer to the mature (coding or non-coding) RNA sequence, not to the underlying DNA reference sequence. Unfortunately, the recommendations include examples like LRG_199t1:r.186_187ins186+1_186+4
[1], which is why a documentation fix was requested by @ifokkema. This would be an additional reason not to group the r.
coordinate system together with c.
and n.
.
My counter proposal would be either one picture containing:
g.
, c.
and n.
coordinates, including examples of how intronic positions can be addressed.r.
and the now non-existing s.
coordinates.p.
coordinates.Or two pictures, one using a coding transcript, containing:
g.
and c.
coordinates, including examples of how intronic positions can be addressed.r.
coordinates.p.
coordinates.and the other one using a non-coding transcript, containing:
g.
and n.
coordinates, including examples of how intronic positions can be addressed.r.
coordinates.If we choose to have two pictures, no additions to the nomenclature are needed. I also think that using a transcript that is both coding as well as non-coding might confuse the reader.
My observations/remarks:
NC(NM)
mapping for the c.
positions for the introns and the positions beyond the transcription initiation site and polyadenylation site.r.
only; it uses an NM reference sequence, and the figure shows that the introns are gone and everything before the transcription initiation site and past the polyadenylation site is absent.n.
in this figure would add quite some complexity. If we want to explain the relationship between all types of reference sequences and prefixes/coordinate systems as well, I agree it's better to add another figure for that.it uses an NM reference sequence
Mutalyzer slices the exons from the reference sequence and transcribes the concatenated result to RNA. The reference protein sequence is generated in a similar way.
The results do not have to correspond to an NM or NP reference sequence, they may not even exist.
Mutalyzer slices the exons from the reference sequence and transcribes the concatenated result to RNA. The reference protein sequence is generated in a similar way.
The results do not have to correspond to an NM or NP reference sequence, they may not even exist.
For NC(NM)
or NC(genesymbol)
in the case of mitochondrial genes, this is indeed the case. However, the graph doesn't show this, and it's also an implementation detail that differs between tools. Again, I recommend not to complicate things too much. The graph doesn't show introns in the RNA box, and whether the source is an NM or an NM mapping of NC sequence with introns removed, doesn't matter.
I think the proposed figure would be fine merged as is. I think it does a good job of illustrating the complexity of the 'c.' (cDNA) sequence representation by coloring it the same as RNA transcript representations, but placing it in the box with DNA sequences.
If I were to make changes, it would be to further highlight this distinction by changing the color of the DNA, RNA, and Protein sequence alphabet containers to gray, and add a legend that describes colors for the represented molecule type (genome / transcript / protein) separately.
This is all good discussion, but I fear that we're opening up a bunch of new questions and work that will keep this very beneficial contribution from getting merged.
Because the diagram proposed by @mihailefter is superior to the current diagram, I propose that we merge it as now. Then, people are free to submit follow-up PRs to make any of the changes in this thread as their time permits.
I'm going to move the PR out of draft and request reviews.
whether the source is an NM or an NM mapping of NC sequence with introns removed, doesn't matter.
Only if there is a 1-to-1 corespondence between the two.
The point of this figure is to show the relationships between the different numbering systems. If we drop the underlying assumption that the reference sequences at each level can be derived from the DNA sequence, we should not use this figure.
@jfjlaros
Only if there is a 1-to-1 corespondence between the two.
We already discussed this extensively in https://github.com/HGVSnomenclature/hgvs-nomenclature/discussions/4. I was obviously talking about the context of this figure, let's not overcomplicate this.
The point of this figure is to show the relationships between the different numbering systems. If we drop the underlying assumption that the reference sequences at each level can be derived from the DNA sequence, we should not use this figure.
Again, let's not overcomplicate things.
let's not overcomplicate things.
Agreed.
In my opinion, this is done by carefully avoiding controversial statements to keep discussions pure. We seem to have the habit of promoting statements from discussions, answers to questions and examples given on the website to rules. This is why I am so (overly) picky on these types of things. Before we know it, we have set a precedent in the slipstream of an other discussion.
@mihailefter offered a much-improved reference sequence diagram in https://github.com/HGVSnomenclature/hgvs-nomenclature/discussions/49#discussioncomment-6342094 to replace docs/assets/RefSeq.jpg. reproduced here:
Compare with current: