HGVSnomenclature / hgvs-nomenclature

HGVS Nomenclature website
https://hgvs-nomenclature.org/
MIT License
5 stars 6 forks source link

Descriptions using the r. coordinate type should not have intronic positions #147

Open ifokkema opened 9 months ago

ifokkema commented 9 months ago

Based on this discussion: https://github.com/HGVSnomenclature/hgvs-nomenclature/discussions/49.

As the r. coordinate type refers to mature (coding) RNA molecules, descriptions using the r. coordinate type should not have intronic positions. Therefore, variant descriptions like NC_000023.10(NM_004006.2):r.186_187ins186+1_186+4 (taken from the numbering page) are actually invalid.

During a HVNC meeting, we agreed that:

To do:

jfjlaros commented 9 months ago

[ ] (square brackets) are used for alleles (see DNA, RNA, protein), which includes multiple inserted sequences at one position, and insertions from a second reference sequence.

The last part of this sentence does not apply to the second example (third bullet). I will check whether leaving the brackets out leads to ambiguities in the nomenclature (I suspect not).

ifokkema commented 9 months ago

I assume the square brackets are added to increase readability and clarity in the existing cases, which would also apply to this new case. But if you want to have a vote for it in the group, that's fine by me; I had the idea that the square brackets were welcomed, but a poll could show otherwise.

jtdendunnen commented 9 months ago

As far as I know the rule is that when in a variant description we change to another reference sequence type this must be described within []. In the examples we go from r. > [c.]

jtdendunnen commented 9 months ago

I did not see this new page before replying to another page but I have serious problems with this proposal. I have given the format "NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4]" some more thought but think we should not use c. My reasoning:

  1. everybody will understand the format r.186_187ins186+1_186+4, why unnecessarily complicate HGVS nomenclature.
  2. the format suggests the inserted sequence (c.186+1_186+4) is "GTAT", which is not correct, inserted is the sequence "guau".
  3. indeed the mature RNA sequence (mRNA) does not contain introns but the original messenger RNA does. We could reason the unspliced RNA is the reference sequence. The recommendations for the RNA reference sequence have: "nucleotide numbering for a RNA reference sequencing follows that of the associated coding or non-coding DNA reference sequence; nucleotide r.123 relates to c.123 or n.123".
ifokkema commented 9 months ago

As far as I know the rule is that when in a variant description we change to another reference sequence type this must be described within []. In the examples we go from r. > [c.]

The definition is "a second reference sequence", which would be slightly unclear here, but I do agree that it would be better to use them in this case.

I did not see this new page before replying to another page but I have serious problems with this proposal. I have given the format "NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4]" some more thought but think we should not use c. My reasoning:

(snip)

To prevent parallel discussions, I've copied my comments to these points here, and linked to this issue from the discussions page.

jfjlaros commented 9 months ago

To elaborate on the first point, what would r.186_187dup mean?

jtdendunnen commented 9 months ago

I am convinced, I go with the proposed format.

One remark, I think we should use NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4], not NM_004006.2:r.186_187ins[NC_000023.10(NM_004006.2):c.186+1_186+4].

jfjlaros commented 9 months ago

I think we should use NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4], not NM_004006.2:r.186_187ins[NC_000023.10(NM_004006.2):c.186+1_186+4].

I think it depends on which reference sequence was used to call the variant. If it was a genomic one (most likely), the first description is preferable indeed. I also want to remark that these descriptions are not equivalent.

marinadistefano commented 8 months ago

I also agree that the [ ] adds clarity and I also agree with Johan's order suggestion above.

jtdendunnen commented 7 months ago

Jeroen, I do not understand your remark "which reference sequence was used to call the variant". The variant on RNA level can only be called using a genomic reference sequence (intron sequences were inserted), so it must be NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4].

jfjlaros commented 7 months ago

Not that it is common practice, but sometimes the transcriptome is used as a reference. The insertion could have been rewritten retrospectively, like any other insertion of a foreign sequence. This is not very likely in this case, as the insertion is only four nucleotides, but in general this is possible.

jtdendunnen commented 7 months ago

To get things clear, I would like to make a step back and first discuss c. insertions, specifically the insertion of an intron-less copy of an RNA transcript. I assume the description would be like NC_000023.11:g.1000000_10000001ins[NM004006.3:c.-2442691;A[22]] (note such an insertion usually has an additional A-tail since it is a copy of the mature RNA transcript). This description should mean that NO intron sequences are inserted. Next I have an insertion at the same sequence, but now including exon sequences and without the A-tail. The description is then NC_000023.11:g.1000000_10000001ins31119228_33211556. Is it also be possible to describe it using the format NC_000023.11:g.1000000_10000001ins[NC_000023.11:(NM004006.2):c.-2442691]? Using "ins[NC_000023.11:(NM_004006.2)]" should mean it includes the intron sequences to make a difference compared to "ins[[NM_004006.3]"?

jfjlaros commented 7 months ago

Since we can insert a c. range in an r. description, I would say that we can also do it the other way around. So an insertion of a transcript including introns could be described as

NC_000023.11:g.1000000_1000001ins[NC_000023.11(NM_004006.2):c.-244_*2691]

and if we want to exclude introns, we can describe it as

NC_000023.11:g.1000000_1000001ins[NC_000023.11(NM_004006.2):r.-244_*2691]
ifokkema commented 7 months ago

NC_000023.11:g.1000000_10000001ins[NC_000023.11(NM004006.2):r.-244*2691]

When there are no sequence differences between the NM and the measured insertion, it would be simpler just to write NC_000023.11:g.1000000_10000001ins[NM_004006.2:r.-244_*2691].

jtdendunnen commented 3 days ago

One complication, r. nucleotides are by definition a, c, g, and u. I assume these can not be inserted in a DNA sequence.

jfjlaros commented 3 days ago

One complication, [...]

We discussed this before here.