Open ifokkema opened 9 months ago
[ ] (square brackets) are used for alleles (see DNA, RNA, protein), which includes multiple inserted sequences at one position, and insertions from a second reference sequence.
The last part of this sentence does not apply to the second example (third bullet). I will check whether leaving the brackets out leads to ambiguities in the nomenclature (I suspect not).
I assume the square brackets are added to increase readability and clarity in the existing cases, which would also apply to this new case. But if you want to have a vote for it in the group, that's fine by me; I had the idea that the square brackets were welcomed, but a poll could show otherwise.
As far as I know the rule is that when in a variant description we change to another reference sequence type this must be described within []. In the examples we go from r. > [c.]
I did not see this new page before replying to another page but I have serious problems with this proposal. I have given the format "NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4]" some more thought but think we should not use c. My reasoning:
As far as I know the rule is that when in a variant description we change to another reference sequence type this must be described within []. In the examples we go from r. > [c.]
The definition is "a second reference sequence", which would be slightly unclear here, but I do agree that it would be better to use them in this case.
I did not see this new page before replying to another page but I have serious problems with this proposal. I have given the format "NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4]" some more thought but think we should not use c. My reasoning:
(snip)
To prevent parallel discussions, I've copied my comments to these points here, and linked to this issue from the discussions page.
r.1_2ins100_200
include intronic sequences or not? To have this depend on whether an NM
or NC(NM)
projection is used, would be dangerous, in my opinion.r.
variants have no introns, except...".r.
numbering, we all agree that we're no longer talking about the DNA sequence but about the RNA sequence. However, that sequence isn't actually included in the reference sequence. The same would hold for using NC(NM)
projections, necessary for referring to intronic positions. So NC(NM):r....
already "transcribes" the DNA sequence in that reference sequence to RNA, assuming all T
s become u
s, etc. I consider it a good option to use the same "magic" for this variant description.To elaborate on the first point, what would r.186_187dup
mean?
I am convinced, I go with the proposed format.
One remark, I think we should use NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4], not NM_004006.2:r.186_187ins[NC_000023.10(NM_004006.2):c.186+1_186+4].
I think we should use
NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4]
, notNM_004006.2:r.186_187ins[NC_000023.10(NM_004006.2):c.186+1_186+4]
.
I think it depends on which reference sequence was used to call the variant. If it was a genomic one (most likely), the first description is preferable indeed. I also want to remark that these descriptions are not equivalent.
I also agree that the [ ] adds clarity and I also agree with Johan's order suggestion above.
Jeroen, I do not understand your remark "which reference sequence was used to call the variant". The variant on RNA level can only be called using a genomic reference sequence (intron sequences were inserted), so it must be NC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4].
Not that it is common practice, but sometimes the transcriptome is used as a reference. The insertion could have been rewritten retrospectively, like any other insertion of a foreign sequence. This is not very likely in this case, as the insertion is only four nucleotides, but in general this is possible.
To get things clear, I would like to make a step back and first discuss c. insertions, specifically the insertion of an intron-less copy of an RNA transcript. I assume the description would be like NC_000023.11:g.1000000_10000001ins[NM004006.3:c.-2442691;A[22]] (note such an insertion usually has an additional A-tail since it is a copy of the mature RNA transcript). This description should mean that NO intron sequences are inserted. Next I have an insertion at the same sequence, but now including exon sequences and without the A-tail. The description is then NC_000023.11:g.1000000_10000001ins31119228_33211556. Is it also be possible to describe it using the format NC_000023.11:g.1000000_10000001ins[NC_000023.11:(NM004006.2):c.-2442691]? Using "ins[NC_000023.11:(NM_004006.2)]" should mean it includes the intron sequences to make a difference compared to "ins[[NM_004006.3]"?
Since we can insert a c.
range in an r.
description, I would say that we can also do it the other way around. So an insertion of a transcript including introns could be described as
NC_000023.11:g.1000000_1000001ins[NC_000023.11(NM_004006.2):c.-244_*2691]
and if we want to exclude introns, we can describe it as
NC_000023.11:g.1000000_1000001ins[NC_000023.11(NM_004006.2):r.-244_*2691]
NC_000023.11:g.1000000_10000001ins[NC_000023.11(NM004006.2):r.-244*2691]
When there are no sequence differences between the NM and the measured insertion, it would be simpler just to write NC_000023.11:g.1000000_10000001ins[NM_004006.2:r.-244_*2691]
.
One complication, r. nucleotides are by definition a, c, g, and u. I assume these can not be inserted in a DNA sequence.
Based on this discussion: https://github.com/HGVSnomenclature/hgvs-nomenclature/discussions/49.
As the r. coordinate type refers to mature (coding) RNA molecules, descriptions using the r. coordinate type should not have intronic positions. Therefore, variant descriptions like
NC_000023.10(NM_004006.2):r.186_187ins186+1_186+4
(taken from the numbering page) are actually invalid.During a HVNC meeting, we agreed that:
NM_004006.2:r.186_187ins[NC_000023.10(NM_004006.2):c.186+1_186+4]
or perhapsNC_000023.10(NM_004006.2):r.186_187ins[c.186+1_186+4]
.NM_004006.2: r.94_264delins[NC_000008.10:g.16369356_16369419inv]
.To do:
NC_000023.11(NM_004006.2):r.649_650ins650-50_650-1
(RNA/splicing)NC_000023.11(NM_004006.2):r.831_832ins831+1_831+67
(RNA/splicing)NC_000023.11(NM_004006.2):r.649_650ins650-1400_650-1268
(RNA/splicing)LRG_199t1:r.186_187ins186+1_186+4
,NC_000023.10(NM_004006.2):r.186_187ins186+1_186+4
,NG_012232.1(NM_004006.2):r.186_187ins186+1_186+4
(numbering)NC_000023.10(NM_004006.2):r.357_358ins357+1_357+12
,NG_012232.1(NM_004006.2):r.357_358ins357+1_357+12
(refseq)ins
withins[c.
and appending a]
.