biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
239 stars 94 forks source link

In-frame del at the end of a transcript throws Index Error #529

Open sptaylor opened 5 years ago

sptaylor commented 5 years ago

c_to_p fails for variant "NM_198180.2:c.408_410delGTG". This is in-frame deletion that involves the termination codon, so I believe it is similar to issues #476, #449 and #525 image

To reproduce in hgvs-shell:

In [1]: v = hp.parse_hgvs_variant("NM_198180.2:c.408_410delGTG")

In [2]: am37.c_to_p(v)
INFO:biocommons.seqrepo.fastadir.fastadir:Opening for reading: /usr/local/share/seqrepo/2017-10-26/sequences/2016/0824/050145/1472014905.920233.fa.bgz
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/usr/local/lib/python2.7/site-packages/hgvs/shell.pyc in <module>()
----> 1 am37.c_to_p(v)

/usr/local/lib/python2.7/site-packages/hgvs/assemblymapper.pyc in c_to_p(self, var_c)
    115
    116     def c_to_p(self, var_c):
--> 117         var_out = super(AssemblyMapper, self).c_to_p(var_c)
    118         return self._maybe_normalize(var_out)
    119

/usr/local/lib/python2.7/site-packages/hgvs/variantmapper.pyc in c_to_p(self, var_c, pro_ac)
    357         for alt_data in all_alt_data:
    358             builder = altseq_to_hgvsp.AltSeqToHgvsp(reference_data, alt_data)
--> 359             var_p = builder.build_hgvsp()
    360             var_ps.append(var_p)
    361

/usr/local/lib/python2.7/site-packages/hgvs/utils/altseq_to_hgvsp.pyc in build_hgvsp(self)
     93                     start = self._alt_data.variant_start_aa - 1
     94                     delta = len(self._alt_seq) - len(self._ref_seq)
---> 95                     while self._ref_seq[start] == self._alt_seq[start]:
     96                         start += 1
     97                     offset = start + abs(delta)

IndexError: string index out of range
reece commented 5 years ago

Confirmed on master (hgvs 1.2.5). :-( I agree that this is likely related to the issues you saw previously.

sptaylor commented 5 years ago

I'm going to take a stab at this. I might have a followup question for you.

Peter-J-Freeman commented 5 years ago

Hi Reece and Paige,

I have been taking a look at this error. Sorry if you have moved on, but it’s particularly interesting.

Other than giving a warning, I’m not sure what you will be able to do with the c_to_p translation for this variant. The final few bases of the reference sequence look like this

WT GAAGAAAGGCGGCTTCAGCTTCCGCTTCGGTCGGCGGTGA ALT GAAGAAAGGCGGCTTCAGCTTCCGCTTCGGTCGGCG---A

There is no 3 prime UTR nor is there a specific poly_A signal annotated in the record.

The index error is being caused due to there being an insufficient number of bases to complete a codon i.e. the final coon of the variant sequence would be GA- (where – is an absent base). Therefore, I don’t think the error is related to #476https://github.com/biocommons/hgvs/issues/476, #449https://github.com/biocommons/hgvs/issues/449 and #525https://github.com/biocommons/hgvs/issues/525.

I’m also not sure what the p. prediction would be. Generally, we lost the termination codon, we would in most instances expect to encounter a termination codon in the 3 prime UTR e.g. NM_000517.4:c.427T>C would map to NP_000508.1:p.(Ter143GlnextTer31). However, because we cannot create a final codon due to a short reference sequence, this format will not work. Perhaps p.(Ter?) could work, but I think p.? would be the safest outcome because at face value based on the transcript reference sequence, we would likely predict Non_STOP_DECAY and no protein.

What do you think?

Cheers

Pete

From: Paige notifications@github.com Reply-To: biocommons/hgvs reply@reply.github.com Date: Monday, 22 October 2018 at 19:41 To: biocommons/hgvs hgvs@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [biocommons/hgvs] In-frame del at the end of a transcript throws Index Error (#529)

c_to_p fails for variant "NM_198180.2:c.408_410delGTG". This is in-frame deletion that involves the termination codon, so I believe it is similar to issues #476https://github.com/biocommons/hgvs/issues/476, #449https://github.com/biocommons/hgvs/issues/449 and #525https://github.com/biocommons/hgvs/issues/525 [Image removed by sender. image]https://user-images.githubusercontent.com/909335/47311635-276aa280-d5ef-11e8-9686-dd537e187112.png

To reproduce in hgvs-shell:

In [1]: v = hp.parse_hgvs_variant("NM_198180.2:c.408_410delGTG")

In [2]: am37.c_to_p(v)

INFO:biocommons.seqrepo.fastadir.fastadir:Opening for reading: /usr/local/share/seqrepo/2017-10-26/sequences/2016/0824/050145/1472014905.920233.fa.bgz


IndexError Traceback (most recent call last)

/usr/local/lib/python2.7/site-packages/hgvs/shell.pyc in ()

----> 1 am37.c_to_p(v)

/usr/local/lib/python2.7/site-packages/hgvs/assemblymapper.pyc in c_to_p(self, var_c)

115

116     def c_to_p(self, var_c):

--> 117 var_out = super(AssemblyMapper, self).c_to_p(var_c)

118         return self._maybe_normalize(var_out)

119

/usr/local/lib/python2.7/site-packages/hgvs/variantmapper.pyc in c_to_p(self, var_c, pro_ac)

357         for alt_data in all_alt_data:

358             builder = altseq_to_hgvsp.AltSeqToHgvsp(reference_data, alt_data)

--> 359 var_p = builder.build_hgvsp()

360             var_ps.append(var_p)

361

/usr/local/lib/python2.7/site-packages/hgvs/utils/altseq_to_hgvsp.pyc in build_hgvsp(self)

 93                     start = self._alt_data.variant_start_aa - 1

 94                     delta = len(self._alt_seq) - len(self._ref_seq)

---> 95 while self._ref_seq[start] == self._alt_seq[start]:

 96                         start += 1

 97                     offset = start + abs(delta)

IndexError: string index out of range

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/biocommons/hgvs/issues/529, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AZVoub4O5s4eKJsDHKmRcHnFptn-K8jMks5unhFJgaJpZM4Xz8JK.

sptaylor commented 5 years ago

Hey @PeteCausey-Freeman

Thank you for weighing in and providing that detailed explanation, I agree. I originally thought it would be related to those other issues affecting the Ter but after digging in, I realized it was a bit trickier. The absence of a 3'UTR explains it. It's also probably why the protein naming has stumped other tools: an old version of Alamut predicted the protein effect as p.Arg136_137ins34, the new version has it as p.137ext?, and Mutalyzer goes with: p.(=) https://mutalyzer.nl/name-checker?description=NM_198180.2%3Ac.408_410delGTG.

I am tagging @naomifox since she was taking a stab at it. I agree that p.? seems to be the safest way to go.

Peter-J-Freeman commented 5 years ago

Great.

Happy to help.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 8 months ago

This issue was closed because it has been stalled for 7 days with no activity.

reece commented 7 months ago

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.