HGVSnomenclature / hgvs-nomenclature

HGVS Nomenclature website
https://hgvs-nomenclature.org/
MIT License
5 stars 6 forks source link

Update reference sequence diagram #155

Closed reece closed 7 months ago

reece commented 7 months ago

@mihailefter offered a much-improved reference sequence diagram in https://github.com/HGVSnomenclature/hgvs-nomenclature/discussions/49#discussioncomment-6342094 to replace docs/assets/RefSeq.jpg. reproduced here:

image

Compare with current:

RefSeq

reece commented 7 months ago

Suggested changes to @mihailefter's diagram for discussion:

@ifokkema @jtdendunnen @ahwagner @jfjlaros @marinadistefano : Any comments on this thread of work?

jfjlaros commented 7 months ago

A few remarks on the suggested changes:

marinadistefano commented 7 months ago

Thanks Reese. I like that proposed diagram a lot and agree with Jeroen's comments.

reece commented 7 months ago

@jfjlaros I have a different view, and I think we can unify them.

The coordinate system -- ie., how to interpret coordinates -- is integers (for g. and p.) or base+offset positions (c., n., r.).

The reference sequence is dna, rna, or aa.

The distinction between c, n, r is NOT coordinate system, but rather reference sequence. Capturing this similarity seems usefult to me.

Our variant types (c, g, m, n, p, r) convey both of the above issues at the same time.

So, my proposal is to have

This would allow us to better express the relationship between the coordinate system and reference sequence type.

jfjlaros commented 7 months ago

@reece I agree that the distinction should be made based on the reference sequence. This is in line with my suggestion to group the coordinate systems as {g., m., o., c., n.} for DNA, {r.} for RNA and {p.} for proteins.

As mentioned in #147, r. coordinates do not have offsets because they refer to the mature (coding or non-coding) RNA sequence, not to the underlying DNA reference sequence. Unfortunately, the recommendations include examples like LRG_199t1:r.186_187ins186+1_186+4 [1], which is why a documentation fix was requested by @ifokkema. This would be an additional reason not to group the r. coordinate system together with c. and n..

My counter proposal would be either one picture containing:

Or two pictures, one using a coding transcript, containing:

and the other one using a non-coding transcript, containing:

If we choose to have two pictures, no additions to the nomenclature are needed. I also think that using a transcript that is both coding as well as non-coding might confuse the reader.

ifokkema commented 7 months ago

My observations/remarks:

jfjlaros commented 7 months ago

it uses an NM reference sequence

Mutalyzer slices the exons from the reference sequence and transcribes the concatenated result to RNA. The reference protein sequence is generated in a similar way.

The results do not have to correspond to an NM or NP reference sequence, they may not even exist.

ifokkema commented 7 months ago

Mutalyzer slices the exons from the reference sequence and transcribes the concatenated result to RNA. The reference protein sequence is generated in a similar way.

The results do not have to correspond to an NM or NP reference sequence, they may not even exist.

For NC(NM) or NC(genesymbol) in the case of mitochondrial genes, this is indeed the case. However, the graph doesn't show this, and it's also an implementation detail that differs between tools. Again, I recommend not to complicate things too much. The graph doesn't show introns in the RNA box, and whether the source is an NM or an NM mapping of NC sequence with introns removed, doesn't matter.

ahwagner commented 7 months ago

I think the proposed figure would be fine merged as is. I think it does a good job of illustrating the complexity of the 'c.' (cDNA) sequence representation by coloring it the same as RNA transcript representations, but placing it in the box with DNA sequences.

If I were to make changes, it would be to further highlight this distinction by changing the color of the DNA, RNA, and Protein sequence alphabet containers to gray, and add a legend that describes colors for the represented molecule type (genome / transcript / protein) separately.

reece commented 7 months ago

This is all good discussion, but I fear that we're opening up a bunch of new questions and work that will keep this very beneficial contribution from getting merged.

Because the diagram proposed by @mihailefter is superior to the current diagram, I propose that we merge it as now. Then, people are free to submit follow-up PRs to make any of the changes in this thread as their time permits.

I'm going to move the PR out of draft and request reviews.

jfjlaros commented 7 months ago

whether the source is an NM or an NM mapping of NC sequence with introns removed, doesn't matter.

Only if there is a 1-to-1 corespondence between the two.

The point of this figure is to show the relationships between the different numbering systems. If we drop the underlying assumption that the reference sequences at each level can be derived from the DNA sequence, we should not use this figure.

ifokkema commented 7 months ago

@jfjlaros

Only if there is a 1-to-1 corespondence between the two.

We already discussed this extensively in https://github.com/HGVSnomenclature/hgvs-nomenclature/discussions/4. I was obviously talking about the context of this figure, let's not overcomplicate this.

The point of this figure is to show the relationships between the different numbering systems. If we drop the underlying assumption that the reference sequences at each level can be derived from the DNA sequence, we should not use this figure.

Again, let's not overcomplicate things.

jfjlaros commented 7 months ago

let's not overcomplicate things.

Agreed.

In my opinion, this is done by carefully avoiding controversial statements to keep discussions pure. We seem to have the habit of promoting statements from discussions, answers to questions and examples given on the website to rules. This is why I am so (overly) picky on these types of things. Before we know it, we have set a precedent in the slipstream of an other discussion.