Descriptions using the r. coordinate type should not have intronic positions

ifokkema commented 9 months ago

Based on this discussion: https://github.com/HGVSnomenclature/hgvs-nomenclature/discussions/49.

As the r. coordinate type refers to mature (coding) RNA molecules, descriptions using the r. coordinate type should not have intronic positions. Therefore, variant descriptions like NC_000023.10(NM_004006.2):r.186_187ins186+1_186+4 (taken from the numbering page) are actually invalid.

During a HVNC meeting, we agreed that:

The mRNA molecule does not have introns, therefore, r. variants do not have intronic positions.
Inserting cDNA positions into an RNA reference sequence was acceptable, but it will get more complicated if we would allow other types of "cross-molecule" insertions as well.
This would mean that intron retention should be described as NM_004006.2:r.186_187ins[NC_000023.10(NM_004006.2):c.186+1_186+4] or perhaps NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4].
Insertions from other parts of the genome could be described as, e.g., NM_004006.2: r.94_264delins[NC_000008.10:g.16369356_16369419inv].

To do:

[ ] Identify all incorrect examples in the docs; by my count:
- NC_000023.11(NM_004006.2):r.649_650ins650-50_650-1 (RNA/splicing)
- NC_000023.11(NM_004006.2):r.831_832ins831+1_831+67 (RNA/splicing)
- NC_000023.11(NM_004006.2):r.649_650ins650-1400_650-1268 (RNA/splicing)
- LRG_199t1:r.186_187ins186+1_186+4, NC_000023.10(NM_004006.2):r.186_187ins186+1_186+4, NG_012232.1(NM_004006.2):r.186_187ins186+1_186+4 (numbering)
- NC_000023.10(NM_004006.2):r.357_358ins357+1_357+12, NG_012232.1(NM_004006.2):r.357_358ins357+1_357+12 (refseq)
[ ] Fix them, by replacing ins with ins[c. and appending a ].
[ ] Add something to the general recommendations page on the use of square brackets. It should have information that these are also necessary for insertions of a different molecular type. Currently, it says:

[ ] (square brackets) are used for alleles (see DNA, RNA, protein), which includes multiple inserted sequences at one position, and insertions from a second reference sequence.

jfjlaros commented 9 months ago

[ ] (square brackets) are used for alleles (see DNA, RNA, protein), which includes multiple inserted sequences at one position, and insertions from a second reference sequence.

The last part of this sentence does not apply to the second example (third bullet). I will check whether leaving the brackets out leads to ambiguities in the nomenclature (I suspect not).

ifokkema commented 9 months ago

I assume the square brackets are added to increase readability and clarity in the existing cases, which would also apply to this new case. But if you want to have a vote for it in the group, that's fine by me; I had the idea that the square brackets were welcomed, but a poll could show otherwise.

jtdendunnen commented 9 months ago

As far as I know the rule is that when in a variant description we change to another reference sequence type this must be described within []. In the examples we go from r. > [c.]

jtdendunnen commented 9 months ago

I did not see this new page before replying to another page but I have serious problems with this proposal. I have given the format "NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4]" some more thought but think we should not use c. My reasoning:

everybody will understand the format r.186_187ins186+1_186+4, why unnecessarily complicate HGVS nomenclature.
the format suggests the inserted sequence (c.186+1_186+4) is "GTAT", which is not correct, inserted is the sequence "guau".
indeed the mature RNA sequence (mRNA) does not contain introns but the original messenger RNA does. We could reason the unspliced RNA is the reference sequence. The recommendations for the RNA reference sequence have: "nucleotide numbering for a RNA reference sequencing follows that of the associated coding or non-coding DNA reference sequence; nucleotide r.123 relates to c.123 or n.123".

ifokkema commented 9 months ago

As far as I know the rule is that when in a variant description we change to another reference sequence type this must be described within []. In the examples we go from r. > [c.]

The definition is "a second reference sequence", which would be slightly unclear here, but I do agree that it would be better to use them in this case.

I did not see this new page before replying to another page but I have serious problems with this proposal. I have given the format "NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4]" some more thought but think we should not use c. My reasoning:

(snip)

To prevent parallel discussions, I've copied my comments to these points here, and linked to this issue from the discussions page.

Considering the first point; it's ambiguous and inconsistent.
- Ambiguous: would r.1_2ins100_200 include intronic sequences or not? To have this depend on whether an NM or NC(NM) projection is used, would be dangerous, in my opinion.
- Consistency: it would be yet another exception. "r. variants have no introns, except...".
I don't think that the second interpretation is necessarily true. Sure, the NM reference sequence is a DNA reference sequence. But when we use the r. numbering, we all agree that we're no longer talking about the DNA sequence but about the RNA sequence. However, that sequence isn't actually included in the reference sequence. The same would hold for using NC(NM) projections, necessary for referring to intronic positions. So NC(NM):r.... already "transcribes" the DNA sequence in that reference sequence to RNA, assuming all Ts become us, etc. I consider it a good option to use the same "magic" for this variant description.
Third point: I'm afraid that's simply not possible. If the unspliced RNA is the reference sequence, we are unable to have any variant description on the RNA level showing an effect of splicing.

jfjlaros commented 9 months ago

To elaborate on the first point, what would r.186_187dup mean?

A duplication of two nucleotides, or
a duplication of an intron plus two nucleotides.

jtdendunnen commented 9 months ago

I am convinced, I go with the proposed format.

One remark, I think we should use NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4], not NM_004006.2:r.186_187ins[NC_000023.10(NM_004006.2):c.186+1_186+4].

jfjlaros commented 9 months ago

I think we should use NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4], not NM_004006.2:r.186_187ins[NC_000023.10(NM_004006.2):c.186+1_186+4].

I think it depends on which reference sequence was used to call the variant. If it was a genomic one (most likely), the first description is preferable indeed. I also want to remark that these descriptions are not equivalent.

marinadistefano commented 8 months ago

I also agree that the [ ] adds clarity and I also agree with Johan's order suggestion above.

jtdendunnen commented 7 months ago

Jeroen, I do not understand your remark "which reference sequence was used to call the variant". The variant on RNA level can only be called using a genomic reference sequence (intron sequences were inserted), so it must be NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4].

jfjlaros commented 7 months ago

Not that it is common practice, but sometimes the transcriptome is used as a reference. The insertion could have been rewritten retrospectively, like any other insertion of a foreign sequence. This is not very likely in this case, as the insertion is only four nucleotides, but in general this is possible.

jtdendunnen commented 7 months ago

To get things clear, I would like to make a step back and first discuss c. insertions, specifically the insertion of an intron-less copy of an RNA transcript. I assume the description would be like NC_000023.11:g.1000000_10000001ins[NM004006.3:c.-2442691;A[22]] (note such an insertion usually has an additional A-tail since it is a copy of the mature RNA transcript). This description should mean that NO intron sequences are inserted. Next I have an insertion at the same sequence, but now including exon sequences and without the A-tail. The description is then NC_000023.11:g.1000000_10000001ins31119228_33211556. Is it also be possible to describe it using the format NC_000023.11:g.1000000_10000001ins[NC_000023.11:(NM004006.2):c.-2442691]? Using "ins[NC_000023.11:(NM_004006.2)]" should mean it includes the intron sequences to make a difference compared to "ins[[NM_004006.3]"?

jfjlaros commented 7 months ago

Since we can insert a c. range in an r. description, I would say that we can also do it the other way around. So an insertion of a transcript including introns could be described as

NC_000023.11:g.1000000_1000001ins[NC_000023.11(NM_004006.2):c.-244_*2691]

and if we want to exclude introns, we can describe it as

NC_000023.11:g.1000000_1000001ins[NC_000023.11(NM_004006.2):r.-244_*2691]

ifokkema commented 7 months ago

NC_000023.11:g.1000000_10000001ins[NC_000023.11(NM004006.2):r.-244*2691]

When there are no sequence differences between the NM and the measured insertion, it would be simpler just to write NC_000023.11:g.1000000_10000001ins[NM_004006.2:r.-244_*2691].

jtdendunnen commented 3 days ago

One complication, r. nucleotides are by definition a, c, g, and u. I assume these can not be inserted in a DNA sequence.

jfjlaros commented 3 days ago

One complication, [...]

We discussed this before here.

HGVSnomenclature / hgvs-nomenclature

Descriptions using the r. coordinate type should not have intronic positions #147

To do: