biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
246 stars 94 forks source link

FLT3 Large Insertion Generating Incorrect HGVS_P #482

Open akeeeshi opened 6 years ago

akeeeshi commented 6 years ago

I found a curious situation when trying to annotator a large insertion in FLT3. This insertion of about ~100 basepairs seems to produce a hgvs_p signifying a delins affecting nearly 400 amino acid codons. I am at a loss for why this would happen. Any context or explanation would be greatly appreciated.

Official HGVS guidelines state that "in-frame insertions containing a translation stop codon in the inserted sequence are described as an insertion, not as a deletion-insertion removing the entire C-terminal amino acid sequence." Information about this can be found here.

Example code snippets are below.

var_c = hp.parse_hgvs_variant('NM_004119.2:c.1756_1757insGTGACCGGCTCCTCAGATAATGAGTACTTCTACGTTGATTTCAGAGAATATGAATATGATCTCAAATGGGAGTTTCCAAGAGAAAATTTAGAGTTTGAG') vm.c_to_p(var_c)

Output:

Out[2]: SequenceVariant(ac=NP_004110.2, type=p, posedit=(Asp586_Ser993delinsGlyAspArgLeuLeuArg))

screen shot 2018-02-27 at 11 59 26 am
akeeeshi commented 6 years ago

Hello, I wanted to follow up and re-open this issue and see if any members of your group had thoughts on the annotation of this particular variant.

AngieHinrichs commented 6 years ago

The protein change that I get for that variant is NP_004110.2:p.Asp586delinsGlyAspArgLeuLeuArgTer . The first affected codon changes from Asp to Gly, so I believe it does need to be a delins not just ins.

I don't know why HGVS would recommend against indicating that the rest of the protein is lost -- that does seem more informative.

reece commented 6 years ago

hgvs 1.2.3 gives:

>>> var_c = hp.parse_hgvs_variant('NM_004119.2:c.1756_1757insGTGACCGGCTCCTCAGATAATGAGTACTTCTACGTTGATTTCAGAGAATATGAATATGATCTCAAATGGGAGTTTCCAAGAGAAAATTTAGAGTTTGAG')
>>> var_p = vm.c_to_p(var_c)
>>> str(var_p)
'NP_004110.2:p.(Ser585_Asp586insGlyAspArgLeuLeuArgTerTerValLeuLeuArgTerPheGlnArgIleTerIleTerSerGlnMetGlyValSerLysArgLysPheArgValTer)'

I suspect that the difference is due to recent fixes (#474, #492).

From the protein delins examples in the recommendations, this variant should be written as a position range. The most comparable example is p.(Pro578_Lys579delinsLeuTer).

@akeeeshi : Would you please check whether you agree with the above return value and report back here?

AngieHinrichs commented 6 years ago

From the protein delins examples in the recommendations, this variant should be written as a position range. The most comparable example is p.(Pro578_Lys579delinsLeuTer).

Even when it's deleting (changing) a single codon? There's also this example:

reece commented 6 years ago

You're right, @AngieHinrichs.

Coming back to the original example, I don't see any guidance for whether this should be written as p.Ser585_Asp586ins… or p.Asp586delins. Both seem plausible to me on first glance. I think I need to draw out cases.

What's your opinion?

AngieHinrichs commented 6 years ago

Thanks Reece. It's not clear to me either what would be the most correct form. I think the complete range Asp586_Ser993 that c_to_p used to output is most informative, but the recommendations simply say to not do that. :) This is why we need something like SPDI in addition to HGVS.

As long as functional effect prediction tools interpret the variant as stop_gained (SO:0001587), hopefully the variant would get the attention it deserves in a genome-wide scan.

Looking around at how a few other online tools handle translate c to p:

VariantValidator: NP_004110.2:p.(Asp586_Ser993delinsGlyAspArgLeuLeuArg)

Mutalyzer: NM_004119.2(FLT3_i001):p.(Asp586_Ser993delinsGlyAspArgLeuLeuArg)

Ensembl Variant Recoder: NP_004110.2:p.Ser585_Asp586insGlyAspArgLeuLeuArgTerTerValLeuLeuArgTerPheGlnArgIleTerIleTerSerGlnMetGlyValSerLysArgLysPheArgValTer

So I see three distinct approaches in use among 6 tools (counting different versions of hgvs twice):

Mine is clearly the odd one out, but there's support for both of the first two. Of those two, I like the first better, because a point insertion before the first modified amino acid implies that that codon would still follow after the inserted sequence, but in the genomic/transcript sequence, the insertion begins mid-codon. I could live with either way though.

reece commented 6 years ago

I believe that the 585_586delins version is preferable.

This trivial case may shed light on the issue: Imagine a SNV that converts Asp at 586 to Ter. Clearly, that should be written as Asp586Ter and not Asp586_Ser993del. (NP_004110.2 is 993 AA long.)

The question is whether the protein consequence should be inferred by a left-to-right greedy mechanism or, rather, by (essentially) a global alignment.

What do you make of that argument?

P.S. Not sure if you're aware that VariantValidator is based on hgvs. So VV isn't an independent sample.

AngieHinrichs commented 6 years ago

I believe that the 585_586delins version is preferable.

585_586delins or 585_586ins? I'm confused now - for delins I think it's either Asp586delins or Asp586_Ser993delins.

This trivial case may shed light on the issue: Imagine a SNV that converts Asp at 586 to Ter. Clearly, that should be written as Asp586Ter and not Asp586_Ser993del. (NP_004110.2 is 993 AA long.)

Thanks for the example! Good point that in that case, we clearly don't make it a big range deletion.

Now if the variant were an insertion instead of an SNV but still changing the Asp to a Ter, would it still be Asp586Ter, or Ser585_Asp586insTer? I bet different tools would give different answers.

I still have a slight preference for delins whether the deleted range is a single base or the rest of the protein, because the genome/transcript variant hits mid-codon, changing a pre-existing amino acid. But if the wider consensus is to treat it as a point insertion at the amino acid level I could conform to that. It would be a pure insertion if the insertion were at the codon boundary.

So I guess we should ask the HGVS folks for a clarification? (i.e. a positive statement of what to do, including what position to use even if the inframe insertion hits mid-codon, instead of simply "don't remove the entire range")

P.S. Not sure if you're aware that VariantValidator is based on hgvs. So VV isn't an independent sample.

Oops -- I forgot about that. :)

Peter-J-Freeman commented 6 years ago

Hi Angie,

Asp586Ter would be correct. Ser585_Asp586insTer does not make sense. If you think about this biologically, once you terminate translation following Ser585, the codon for Asp586 would not be translated, thus Asp586 would never exist. It’s also likely that the Nonsense Mediated Decay pathway would be activated.

I do not want to take your bet that software may get this wrong though!!!

Cheers

Pete

From: Angie Hinrichs notifications@github.com Reply-To: biocommons/hgvs reply@reply.github.com Date: Monday, 17 September 2018 at 20:11 To: biocommons/hgvs hgvs@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [biocommons/hgvs] FLT3 Large Insertion Generating Incorrect HGVS_P (#482)

I believe that the 585_586delins version is preferable.

585_586delins or 585_586ins? I'm confused now - for delins I think it's either Asp586delins or Asp586_Ser993delins.

This trivial case may shed light on the issue: Imagine a SNV that converts Asp at 586 to Ter. Clearly, that should be written as Asp586Ter and not Asp586_Ser993del. (NP_004110.2 is 993 AA long.)

Thanks for the example! Good point that in that case, we clearly don't make it a big range deletion.

Now if the variant were an insertion instead of an SNV but still changing the Asp to a Ter, would it still be Asp586Ter, or Ser585_Asp586insTer? I bet different tools would give different answers.

I still have a slight preference for delins whether the deleted range is a single base or the rest of the protein, because the genome/transcript variant hits mid-codon, changing a pre-existing amino acid. But if the wider consensus is to treat it as a point insertion at the amino acid level I could conform to that. It would be a pure insertion if the insertion were at the codon boundary.

So I guess we should ask the HGVS folks for a clarification? (i.e. a positive statement of what to do, including what position to use even if the inframe insertion hits mid-codon, instead of simply "don't remove the entire range")

P.S. Not sure if you're aware that VariantValidator is based on hgvs. So VV isn't an independent sample.

Oops -- I forgot about that. :)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/biocommons/hgvs/issues/482#issuecomment-422135098, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AZVoueX4lQSB--M08ZqPhZwQVI4GfmTIks5ub_PngaJpZM4SVTGl.

akeeeshi commented 6 years ago

Hi all,

I apologize for the delayed response. From my perspective hgvs 1.2.3 giving

NP_004110.2:p.(Ser585_Asp586insGlyAspArgLeuLeuArgTerTerValLeuLeuArgTerPheGlnArgIleTerIleTerSerGlnMetGlyValSerLysArgLysPheArgValTer)

as an output might be the most incorrect way of representing this variant, echoing what @PeteCausey-Freeman stated above. I say this particularly due to continuing to list out inserted amino acids after the first termination is added. In that case, I would almost say that the original behavior of

NP_004110.2:p.(Asp586_Ser993delinsGlyAspArgLeuLeuArg)

would be preferred. I think in this case the HGVS rules might be taking the wrong approach. I am reaching out to them promptly. Please let me know if you have additional thoughts or if I can help out in any additional way.

reece commented 5 years ago

This question was raised on the HGVS Facebook page (sigh).

Facebook is a closed system (you have to have an account to view the page). Here's the text:

Q: when I have variant LRG_457t1:c.1756_1757insGATCTGGGGATAGACTCCTTCGGTAATGAGTGCTT what is the correct #HGVS description for the predicted consequence at protein level; p.(Asp586_Ser993delinsGlySerGlyAspArgLeuLeuArg), p.(Ser585_Asp586insGlySerGlyAspArgLeuLeuArgTerTerValLeu) or p.(Asp586Glyfs*9)?

A: according to #HGVS insertions containing a translation stop codon in the inserted sequence are described as an insertion, not as a deletion-insertion removing the entire C-terminal amino acid sequence. Therefore the correct description is p.(Ser585_Asp586insGlySerGlyAspArgLeuLeuArgTer). Note that amino acids encoded after the first termination codon are not listed.

akeeeshi commented 5 years ago

I was just about to repost on this thread after the holidays. I actually posed this question to the HGVS group over email two months ago. They finally responded two days ago! Looks like they posted their response on FB as well.

@reece would you view it as a viable improvement to try and make these deletion/insertion generating early termination follow HGVS protocol within the package?

reece commented 5 years ago

Yes, we should implement what HGVS recommends here. That is, we should revert to the original behavior, but add Ter.

In rereading this thread, I realized that I was so focused on coordinates and ins v. delins that I overlooked the internal Ter in the insert. ☹️

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 11 months ago

This issue was closed because it has been stalled for 7 days with no activity.

reece commented 9 months ago

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.