Issue: Inconsistent normalization results compared to Mutalyzer

chaopower-zhang commented 1 month ago

Hi,

I encountered an issue with the hgvs Python library where the normalized HGVS result seems to differ from that of Mutalyzer, and also from what I observed in IGV.

Variant details:

Input: NC_000004.11:g.55955030_55955031insA
Expected result (from Mutalyzer): NC_000004.11(NM_002253.3):c.3510+4dup
Result from hgvs: NC_000004.11(NM_002253.3):c.3510+2dup

I verified the insertion position using IGV, which aligns with the result from Mutalyzer (c.3510+4dup). However, the hgvs library returns c.3510+2dup. I am using the UTA database with the latest version.

Steps to reproduce:

Parse and normalize the following variant:

os.environ["UTA_DB_URL"] = "postgresql://anonymous@192.168.38.90/uta/uta_20210129b"
os.environ["HGVS_SEQREPO_DIR"] = "/soft/hgvs/seqrepo_lib/2024-02-20/"
hgvs_g = 'NC_000004.11:g.55955030_55955031insA'
tx_num = 'NM_002253.2'
var_c = am37.g_to_c(var, tx_num)
normalized_var_g = am37.c_to_g(var_c)
normalized_var_c = am37.g_to_c(normalized_var_g, 'NM_002253.2')
print(normalized_var_c )

SequenceVariant(ac=NM_002253.2, type=c, posedit=3510+2dup, gene=None)

chaopower-zhang commented 1 month ago

This is on the reverse strand, so the duplication should be closest to the 5' end. Therefore, it should be c.3510+4dup？

andreasprlic commented 1 month ago

I think there is a mistake in your attempt to normalize the genomic coordinates. Due to the HGVS 3' rule, genomic coordinates will get right-aligned by default, which moves the variant closer to the intron/exon boundary. In this case, since this is a reverse strand gene, this results in the +2 variant. If you wanted to do the correct shuffling here, you need to change the shuffling direction in the normalizer and 5' shuffle.

Perhaps the different shuffling directions are more clear if you view this from a transcript centric perspective with the forward strand chromosome orientation reversed. On reverse strand transcripts, you need 5' shuffled coordinates on the forward facing strand to be consistent with 3'-shuffled (right-shuffled) coordinates in transcript space.

hgvs_g    : NC_000004.11:g.55955030_55955031insA
hgvs_c    : NM_002253.3:c.3510+4_3510+5insT
hgvs_p    : NP_002244.1:p.?
          :    55,955,050          55,955,030
chrom pos :    |    .    |    .    |    .    |    .    
seq  <- 3': CGTTCGATTACGAGTCGTCCAAACAGTGGAGGTAGGTTCTTCG 5'
seq  -> 5': GCAAGCTAATGCTCAGCAGGTTTGTCACCTCCATCCAAGAAGC 3'
region    :                     |-|                    
tx seq -> : GCAAGCTAATGCTCAGCAG                        
tx pos    :    .    |    .    |                        
          :         3500      3510 3510+5 3510+12 3510+20
aa seq -> : uGlnAlaAsnAlaGlnGln                        
aa pos    :  ...            |||                        
          :                 1170                         
ref>alt   : A[3]>A[4]

here how to 5' shuffle:

In [40]: left_normalizer = Normalizer(hgvs_data_provider, shuffle_direction=5)
In [41]: hgvs_g = 'NC_000004.11:g.55955030_55955031insA'
In [42]: var_g = parse(hgvs_g)
In [43]: norm_var_g_5 = left_normalizer.normalize(var_g)
In [44]: print(norm_var_g_5)
NC_000004.11:g.55955031dup

In [45]: var_c = am37.g_to_c(norm_var_g_5, tx_ac)
In [46]: print(var_c)
NM_002253.2:c.3510+4dup

Note: we can't normalize in transcript space here, since the reference sequence is the transcript. It does not have intronic sequence, and as such we can not say if the insertion makes this a duplication. (There's also a risk that a different reference assembly has a different intronic sequence on the chromosome).

In conclusion I believe the library works correctly and we should close this issue.

chaopower-zhang commented 1 month ago

I think there is a mistake in your attempt to normalize the genomic coordinates. Due to the HGVS 3' rule, genomic coordinates will get right-aligned by default, which moves the variant closer to the intron/exon boundary. In this case, since this is a reverse strand gene, this results in the +2 variant. If you wanted to do the correct shuffling here, you need to change the shuffling direction in the normalizer and 5' shuffle.

Perhaps the different shuffling directions are more clear if you view this from a transcript centric perspective with the forward strand chromosome orientation reversed. On reverse strand transcripts, you need 5' shuffled coordinates on the forward facing strand to be consistent with 3'-shuffled (right-shuffled) coordinates in transcript space.
hgvs_g    : NC_000004.11:g.55955030_55955031insA
hgvs_c    : NM_002253.3:c.3510+4_3510+5insT
hgvs_p    : NP_002244.1:p.?
          :    55,955,050          55,955,030
chrom pos :    |    .    |    .    |    .    |    .    
seq  <- 3': CGTTCGATTACGAGTCGTCCAAACAGTGGAGGTAGGTTCTTCG 5'
seq  -> 5': GCAAGCTAATGCTCAGCAGGTTTGTCACCTCCATCCAAGAAGC 3'
region    :                     |-|                    
tx seq -> : GCAAGCTAATGCTCAGCAG                        
tx pos    :    .    |    .    |                        
          :         3500      3510 3510+5 3510+12 3510+20
aa seq -> : uGlnAlaAsnAlaGlnGln                        
aa pos    :  ...            |||                        
          :                 1170                         
ref>alt   : A[3]>A[4]
here how to 5' shuffle:
In [40]: left_normalizer = Normalizer(hgvs_data_provider, shuffle_direction=5)
In [41]: hgvs_g = 'NC_000004.11:g.55955030_55955031insA'
In [42]: var_g = parse(hgvs_g)
In [43]: norm_var_g_5 = left_normalizer.normalize(var_g)
In [44]: print(norm_var_g_5)
NC_000004.11:g.55955031dup

In [45]: var_c = am37.g_to_c(norm_var_g_5, tx_ac)
In [46]: print(var_c)
NM_002253.2:c.3510+4dup
Note: we can't normalize in transcript space here, since the reference sequence is the transcript. It does not have intronic sequence, and as such we can not say if the insertion makes this a duplication. (There's also a risk that a different reference assembly has a different intronic sequence on the chromosome).

In conclusion I believe the library works correctly and we should close this issue.

Thank you for the explanation! I understand how the HGVS 3' rule affects right alignment and the shuffling direction for reverse strand genes. Viewing this from a transcript-centric perspective is very helpful. I'll definitely consider changing the shuffling direction in the normalizer to ensure consistency in the results. Also, could it automatically select the 3' end during normalization based on the gene orientation?

Thanks again for your insights!

andreasprlic commented 1 month ago

In general you can normalize in transcript space. This won't work in your example though, since we don't have the intronic sequence as part of the transcript, and as such there is nothing to normalize with for intronic variants.

If you normalize in genomic space to make up for this, how do you select in what direction to normalize? If you have two genes on different strands close by, which one should you pick? As a user of the library you need to make this choice and pick the direction in which you want to normalize.

Also another thought: Although I use the hgvs library a lot, I actually rarely use "normalize". For example: in the variant that you provided I find it strange to call an event that is ultimately an A[3]>A[4] a duplication of one A. Fully justified representation of variation is more explanatory compared to the 3' rule in my opinion. Also, making this variant a duplication instead of an insertion, changes the underlying coordinates. As you might tell, I am no fan of the "ins become dups" prioritization rule in HGVS.

andreasprlic commented 1 month ago

no defect found.

biocommons / hgvs

Issue: Inconsistent normalization results compared to Mutalyzer #749

Variant details:

Steps to reproduce: