Closed chaopower-zhang closed 1 month ago
This is on the reverse strand, so the duplication should be closest to the 5' end. Therefore, it should be c.3510+4dup?
I think there is a mistake in your attempt to normalize the genomic coordinates. Due to the HGVS 3' rule, genomic coordinates will get right-aligned by default, which moves the variant closer to the intron/exon boundary. In this case, since this is a reverse strand gene, this results in the +2 variant. If you wanted to do the correct shuffling here, you need to change the shuffling direction in the normalizer and 5' shuffle.
Perhaps the different shuffling directions are more clear if you view this from a transcript centric perspective with the forward strand chromosome orientation reversed. On reverse strand transcripts, you need 5' shuffled coordinates on the forward facing strand to be consistent with 3'-shuffled (right-shuffled) coordinates in transcript space.
hgvs_g : NC_000004.11:g.55955030_55955031insA
hgvs_c : NM_002253.3:c.3510+4_3510+5insT
hgvs_p : NP_002244.1:p.?
: 55,955,050 55,955,030
chrom pos : | . | . | . | .
seq <- 3': CGTTCGATTACGAGTCGTCCAAACAGTGGAGGTAGGTTCTTCG 5'
seq -> 5': GCAAGCTAATGCTCAGCAGGTTTGTCACCTCCATCCAAGAAGC 3'
region : |-|
tx seq -> : GCAAGCTAATGCTCAGCAG
tx pos : . | . |
: 3500 3510 3510+5 3510+12 3510+20
aa seq -> : uGlnAlaAsnAlaGlnGln
aa pos : ... |||
: 1170
ref>alt : A[3]>A[4]
here how to 5' shuffle:
In [40]: left_normalizer = Normalizer(hgvs_data_provider, shuffle_direction=5)
In [41]: hgvs_g = 'NC_000004.11:g.55955030_55955031insA'
In [42]: var_g = parse(hgvs_g)
In [43]: norm_var_g_5 = left_normalizer.normalize(var_g)
In [44]: print(norm_var_g_5)
NC_000004.11:g.55955031dup
In [45]: var_c = am37.g_to_c(norm_var_g_5, tx_ac)
In [46]: print(var_c)
NM_002253.2:c.3510+4dup
Note: we can't normalize in transcript space here, since the reference sequence is the transcript. It does not have intronic sequence, and as such we can not say if the insertion makes this a duplication. (There's also a risk that a different reference assembly has a different intronic sequence on the chromosome).
In conclusion I believe the library works correctly and we should close this issue.
I think there is a mistake in your attempt to normalize the genomic coordinates. Due to the HGVS 3' rule, genomic coordinates will get right-aligned by default, which moves the variant closer to the intron/exon boundary. In this case, since this is a reverse strand gene, this results in the +2 variant. If you wanted to do the correct shuffling here, you need to change the shuffling direction in the normalizer and 5' shuffle.
Perhaps the different shuffling directions are more clear if you view this from a transcript centric perspective with the forward strand chromosome orientation reversed. On reverse strand transcripts, you need 5' shuffled coordinates on the forward facing strand to be consistent with 3'-shuffled (right-shuffled) coordinates in transcript space.
hgvs_g : NC_000004.11:g.55955030_55955031insA hgvs_c : NM_002253.3:c.3510+4_3510+5insT hgvs_p : NP_002244.1:p.? : 55,955,050 55,955,030 chrom pos : | . | . | . | . seq <- 3': CGTTCGATTACGAGTCGTCCAAACAGTGGAGGTAGGTTCTTCG 5' seq -> 5': GCAAGCTAATGCTCAGCAGGTTTGTCACCTCCATCCAAGAAGC 3' region : |-| tx seq -> : GCAAGCTAATGCTCAGCAG tx pos : . | . | : 3500 3510 3510+5 3510+12 3510+20 aa seq -> : uGlnAlaAsnAlaGlnGln aa pos : ... ||| : 1170 ref>alt : A[3]>A[4]
here how to 5' shuffle:
In [40]: left_normalizer = Normalizer(hgvs_data_provider, shuffle_direction=5) In [41]: hgvs_g = 'NC_000004.11:g.55955030_55955031insA' In [42]: var_g = parse(hgvs_g) In [43]: norm_var_g_5 = left_normalizer.normalize(var_g) In [44]: print(norm_var_g_5) NC_000004.11:g.55955031dup In [45]: var_c = am37.g_to_c(norm_var_g_5, tx_ac) In [46]: print(var_c) NM_002253.2:c.3510+4dup
Note: we can't normalize in transcript space here, since the reference sequence is the transcript. It does not have intronic sequence, and as such we can not say if the insertion makes this a duplication. (There's also a risk that a different reference assembly has a different intronic sequence on the chromosome).
In conclusion I believe the library works correctly and we should close this issue.
Thank you for the explanation! I understand how the HGVS 3' rule affects right alignment and the shuffling direction for reverse strand genes. Viewing this from a transcript-centric perspective is very helpful. I'll definitely consider changing the shuffling direction in the normalizer to ensure consistency in the results. Also, could it automatically select the 3' end during normalization based on the gene orientation?
Thanks again for your insights!
In general you can normalize in transcript space. This won't work in your example though, since we don't have the intronic sequence as part of the transcript, and as such there is nothing to normalize with for intronic variants.
If you normalize in genomic space to make up for this, how do you select in what direction to normalize? If you have two genes on different strands close by, which one should you pick? As a user of the library you need to make this choice and pick the direction in which you want to normalize.
Also another thought: Although I use the hgvs library a lot, I actually rarely use "normalize". For example: in the variant that you provided I find it strange to call an event that is ultimately an A[3]>A[4]
a duplication of one A. Fully justified representation of variation is more explanatory compared to the 3' rule in my opinion. Also, making this variant a duplication instead of an insertion, changes the underlying coordinates. As you might tell, I am no fan of the "ins become dups" prioritization rule in HGVS.
no defect found.
Hi,
I encountered an issue with the
hgvs
Python library where the normalized HGVS result seems to differ from that of Mutalyzer, and also from what I observed in IGV.Variant details:
NC_000004.11:g.55955030_55955031insA
NC_000004.11(NM_002253.3):c.3510+4dup
hgvs
:NC_000004.11(NM_002253.3):c.3510+2dup
I verified the insertion position using IGV, which aligns with the result from Mutalyzer (c.3510+4dup). However, the
hgvs
library returnsc.3510+2dup
. I am using the UTA database with the latest version.Steps to reproduce:
Parse and normalize the following variant:
SequenceVariant(ac=NM_002253.2, type=c, posedit=3510+2dup, gene=None)