biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
244 stars 94 forks source link

c_to_p for dups that start in transcript but end in UTR #715

Closed b0d0nne11 closed 5 months ago

b0d0nne11 commented 10 months ago

Any duplications with an end position at or past the stop codon should be classified as 3'UTR regardless of start position. Currently mapping NM_153223.3:c.2959_*1dup yields NP_694955.2:p.(Met1?). We believe that should map to NP_694955.2:p.? because all other variants in the UTR map to p.?.

In [1]: var_c = parse('NM_153223.3:c.2959_*1dup')

In [2]: var_p = c_to_p(var_c)

In [3]: var_p
Out[3]: SequenceVariant(ac=NP_694955.2, type=p, posedit=Met1?, gene=None)

In [4]: str(var_p)
Out[4]: 'NP_694955.2:p.Met1?'

We expect this to result in NP_694955.2:p.? instead.

b0d0nne11 commented 10 months ago

After discussing this internally we think this also applies similarly to insertions.

In [1]: var_c = parse('NM_004985.4:c.567_*1insCCC')

In [2]: var_p = c_to_p(var_c)

In [3]: str(var_p)
Out[3]: 'NP_004976.2:p.(Ter189Ter)'

We expect this to also return p.?. I'll extend my PR to handle these cases.

reece commented 8 months ago

I agree that the current responses for both examples are wrong. However, what it should be is less clear to me.

Can you please elaborate on your rationale for p.? in these cases?

gostachowiak commented 8 months ago

@reece For the mutations affected by this pull request, the entire coding sequence is unchanged and the added material is within the 3' UTR.

c.39_*1insA

c.12_*1dup

Therefore, these are 3' UTR mutations. All other 3' UTR mutations get p.?, so these mutations should also get p.?

andreasprlic commented 7 months ago

What is your source for the variant representation of NM_153223.3:c.2959_*1dup ? Did you call g_to_c previously?

If we try to represent the underlying genomic even that causes this variant and use the left-shuffled insertion representation, I believe we end up with NC_000005.10:g.123346517_123346518insATTA. Performing g_to_c on this representation results in NM_153223.3:C.*1_*2insTAAT and c_to_p then yields p.?. So this issue is also related to ins->dup in hgvs conventions.

To be honest, personally I am not a big fan of this hgvs-dup "prioritization" rule. In my opinion this modifies the underlying nature of the genomic event and drastically changes the coordinates. We would be often better off without the representation as dup (for most small variants). Your variant is one of the examples why.

Btw, if I plug in right-shuffled coordinates for this variant I end up with p.(=). I am not sure which of the two hgvs_p is "better".

gostachowiak commented 7 months ago

@andreasprlic We are just attempting to follow the guidelines as they exist, which say that if you can represent something as a dup, it must be represented as a dup, and that nomenclature should be 3' shifted. The cdot nomenclature NM_153223.3:c.2959_*1dup is correct HGVS nomenclature according to those rules, and the pull request fixes a bug where the pdot is assigned incorrectly.

They key point for the examples in the pull request is that the inserted material is inserted AFTER the stop codon, in the sense that the ribosome will make it all the way to the stop codon and not encounter any mutation. Therefore, in the pull request these variants are identified as being in the 3' UTR region, and then end up with p.? like any other 3' UTR variant.

To answer your initial question, the cdot NM_153223.3:c.2959_*1dup comes from calling g_to_c on NC_000005.9:g.122682212_122682215dup, which is itself the left-shifted version of the correct gdot (NC_000005.9:g.122682216_122682219dup), because the transcript is negative strand.

andreasprlic commented 6 months ago

@reece I feel this example demonstrates a problem with the hgvs recommendation to represent insertions as duplications where appropriate. The dup changes the underlying nature (coordinates) of the event and as a consequence we have problems with the hgvs_p here. I believe you are involved into some of the future of hgvs discussions. Is the ins->dup recommendation something that could get more nuance? Perhaps on the chromosomal level insertions don't need to get changed to duplications, but this is only recommended for the protein level?

gostachowiak commented 6 months ago

@andreasprlic @reece I think we probably all agree that returning p.Met1? is completely wrong for NM_153223.3:c.2959_*1dup.

This pull request returns p.? instead, which is the same thing returned for NM_153223.3:C.*1_*2insTAAT which is what the cdot would be if hgvs guidelines were changed to eliminate dups.

Based on that, can this pull request be merged, and future changes to hgvs guidelines be dealt with separately?

gostachowiak commented 6 months ago

@andreasprlic @reece by the way, the pull request also fixes non-duplication insertions just after the stop codon. The second unit test added is NM_004985.4:c.567_*1insCCC --> p.?

andreasprlic commented 5 months ago

@gostachowiak apologies for the slow response, yes definitely p.Met1? is completely wrong. You refer to "this pull request" - do you mean #716 ? I took a look at the latest version of that and this looks much more concise now! As such I approved.

gostachowiak commented 5 months ago

@andreasprlic yes I meant #716. Thanks!