Update reference sequence diagram

reece commented 8 months ago

@mihailefter offered a much-improved reference sequence diagram in https://github.com/HGVSnomenclature/hgvs-nomenclature/discussions/49#discussioncomment-6342094 to replace docs/assets/RefSeq.jpg. reproduced here:

Compare with current:

RefSeq

reece commented 8 months ago

Suggested changes to @mihailefter's diagram for discussion:

Use a chromosomal sequence so that g. coordinates are more typical
Also keep a local g. frame (e.g., NG), as currently depicted
Consider a note about - strand genes
Move c. to the RNA box and add n. so that c, n, r are in the same transcript box
Depict gene structure as cartoon with thin (UTR) and think (CDS) boxes like ──████───█████─────██████──███─

@ifokkema @jtdendunnen @ahwagner @jfjlaros @marinadistefano : Any comments on this thread of work?

jfjlaros commented 8 months ago

A few remarks on the suggested changes:

The c. and n. coordinate systems are genomic, they should be in the DNA box.
We cannot express both coding and non-coding coordinates on RNA because there is only one coordinate system for RNA (r.). Therefore, we cannot use both c. and n. in one picture as it it unclear to which coordinate system the r. coordinates are related to. I have actually considered proposing a coordinate system for non-coding RNA (s. for structural) to fill this gap.

marinadistefano commented 8 months ago

Thanks Reese. I like that proposed diagram a lot and agree with Jeroen's comments.

reece commented 8 months ago

@jfjlaros I have a different view, and I think we can unify them.

The coordinate system -- ie., how to interpret coordinates -- is integers (for g. and p.) or base+offset positions (c., n., r.).

The reference sequence is dna, rna, or aa.

The distinction between c, n, r is NOT coordinate system, but rather reference sequence. Capturing this similarity seems usefult to me.

Our variant types (c, g, m, n, p, r) convey both of the above issues at the same time.

So, my proposal is to have

a genomic tier with g. coordinates (perhaps on a NC and aligned NG)
a transcript tier with c, n, r coordinates written as NC(NM)
a protein tier with p. coordinates

This would allow us to better express the relationship between the coordinate system and reference sequence type.

jfjlaros commented 8 months ago

@reece I agree that the distinction should be made based on the reference sequence. This is in line with my suggestion to group the coordinate systems as {g., m., o., c., n.} for DNA, {r.} for RNA and {p.} for proteins.

As mentioned in #147, r. coordinates do not have offsets because they refer to the mature (coding or non-coding) RNA sequence, not to the underlying DNA reference sequence. Unfortunately, the recommendations include examples like LRG_199t1:r.186_187ins186+1_186+4 [1], which is why a documentation fix was requested by @ifokkema. This would be an additional reason not to group the r. coordinate system together with c. and n..

My counter proposal would be either one picture containing:

A genomic tier with g., c. and n. coordinates, including examples of how intronic positions can be addressed.
A transcript tier with r. and the now non-existing s. coordinates.
A protein tier with p. coordinates.

Or two pictures, one using a coding transcript, containing:

A genomic tier with g. and c. coordinates, including examples of how intronic positions can be addressed.
A transcript tier with r. coordinates.
A protein tier with p. coordinates.

and the other one using a non-coding transcript, containing:

A genomic tier with g. and n. coordinates, including examples of how intronic positions can be addressed.
A transcript tier with r. coordinates.

If we choose to have two pictures, no additions to the nomenclature are needed. I also think that using a transcript that is both coding as well as non-coding might confuse the reader.

ifokkema commented 8 months ago

My observations/remarks:

I do agree that having more "typical" genomic positions would help. We could pick a gene and use actual positions.
We are required to use an NC(NM) mapping for the c. positions for the introns and the positions beyond the transcription initiation site and polyadenylation site.
This is also an additional reason to keep the RNA box for r. only; it uses an NM reference sequence, and the figure shows that the introns are gone and everything before the transcription initiation site and past the polyadenylation site is absent.
We might be trying to put too much in one figure. I believe one reason Mihai's version is so much better is that it's very clean. If we cramp too much information in one figure, it may not be that clean anymore. E.g., adding an NG, adding info on genes on the negative strand, and using n. in this figure would add quite some complexity. If we want to explain the relationship between all types of reference sequences and prefixes/coordinate systems as well, I agree it's better to add another figure for that.

jfjlaros commented 8 months ago

it uses an NM reference sequence

Mutalyzer slices the exons from the reference sequence and transcribes the concatenated result to RNA. The reference protein sequence is generated in a similar way.

The results do not have to correspond to an NM or NP reference sequence, they may not even exist.

ifokkema commented 8 months ago

Mutalyzer slices the exons from the reference sequence and transcribes the concatenated result to RNA. The reference protein sequence is generated in a similar way.

The results do not have to correspond to an NM or NP reference sequence, they may not even exist.

For NC(NM) or NC(genesymbol) in the case of mitochondrial genes, this is indeed the case. However, the graph doesn't show this, and it's also an implementation detail that differs between tools. Again, I recommend not to complicate things too much. The graph doesn't show introns in the RNA box, and whether the source is an NM or an NM mapping of NC sequence with introns removed, doesn't matter.

ahwagner commented 8 months ago

I think the proposed figure would be fine merged as is. I think it does a good job of illustrating the complexity of the 'c.' (cDNA) sequence representation by coloring it the same as RNA transcript representations, but placing it in the box with DNA sequences.

If I were to make changes, it would be to further highlight this distinction by changing the color of the DNA, RNA, and Protein sequence alphabet containers to gray, and add a legend that describes colors for the represented molecule type (genome / transcript / protein) separately.

reece commented 8 months ago

This is all good discussion, but I fear that we're opening up a bunch of new questions and work that will keep this very beneficial contribution from getting merged.

Because the diagram proposed by @mihailefter is superior to the current diagram, I propose that we merge it as now. Then, people are free to submit follow-up PRs to make any of the changes in this thread as their time permits.

I'm going to move the PR out of draft and request reviews.

jfjlaros commented 8 months ago

whether the source is an NM or an NM mapping of NC sequence with introns removed, doesn't matter.

Only if there is a 1-to-1 corespondence between the two.

The point of this figure is to show the relationships between the different numbering systems. If we drop the underlying assumption that the reference sequences at each level can be derived from the DNA sequence, we should not use this figure.

ifokkema commented 8 months ago

@jfjlaros

Only if there is a 1-to-1 corespondence between the two.

We already discussed this extensively in https://github.com/HGVSnomenclature/hgvs-nomenclature/discussions/4. I was obviously talking about the context of this figure, let's not overcomplicate this.

The point of this figure is to show the relationships between the different numbering systems. If we drop the underlying assumption that the reference sequences at each level can be derived from the DNA sequence, we should not use this figure.

Again, let's not overcomplicate things.

jfjlaros commented 8 months ago

let's not overcomplicate things.

Agreed.

In my opinion, this is done by carefully avoiding controversial statements to keep discussions pure. We seem to have the habit of promoting statements from discussions, answers to questions and examples given on the website to rules. This is why I am so (overly) picky on these types of things. Before we know it, we have set a precedent in the slipstream of an other discussion.

HGVSnomenclature / hgvs-nomenclature

Update reference sequence diagram #155