biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
236 stars 94 forks source link

Index Error for variant using am37.c_to_g() #606

Open sptaylor opened 4 years ago

sptaylor commented 4 years ago

We encountered a bug when running c_to_g via assembly mapper for variant 'NM_001291722.1:c.283-3C>T':

$ hgvs-shell
Python 3.6.9 (default, Jan  8 2020, 08:00:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

############################################################################
hgvs-shell -- interactive hgvs
hgvs version: 1.5.1
data provider url: postgresql://anonymous:anonymous@uta.biocommons.org/uta/uta_20180821
schema_version: 1.1
data_version: uta_20180821
sequences source: SeqRepo (/Users/paige.taylor/Data/seqrepo/2018-08-21)

The following variables are defined:
* global_config
* hp, parser, hgvs_parser -- Parser instance
* hdp, hgvs_data_provider -- UTA data provider instance
* vm, variant_mapper, hgvs_variant_mapper -- VariantMapper instance
* am37, hgvs_assembly_mapper_37 -- GRCh37 Assembly Mapper instance
* am38, projector, hgvs_assembly_mapper_38 -- GRCh38 Assembly Mapper instances
* hn, normalizer, hgvs_normalizer -- Normalizer instance
* hv, validator, hgvs_validator) -- Validator instance

The following functions are available:
  * parse, normalize, validate
  * g_to_c, g_to_n, g_to_t,
  * c_to_g, c_to_n, c_to_p,
  * n_to_c, n_to_g,
  * t_to_g,
  * get_relevant_transcripts

When submitting bug reports, include the version header shown above
and use these variables/variable names whenever possible.

In [1]: var = hp.parse_hgvs_variant('NM_001291722.1:c.283-3C>T')

In [2]: am37.c_to_g(var)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~/virtualenv/hgvs_v15/lib/python3.6/site-packages/hgvs/shell.py in <module>
----> 1 am37.c_to_g(var)

~/virtualenv/hgvs_v15/lib/python3.6/site-packages/hgvs/assemblymapper.py in c_to_g(self, var_c)
    114         alt_ac = self._alt_ac_for_tx_ac(var_c.ac)
    115         var_out = super(AssemblyMapper, self).c_to_g(
--> 116             var_c, alt_ac, alt_aln_method=self.alt_aln_method)
    117         return self._maybe_normalize(var_out)
    118

~/virtualenv/hgvs_v15/lib/python3.6/site-packages/hgvs/variantmapper.py in c_to_g(self, var_c, alt_ac, alt_aln_method)
    301             pos_n = mapper.g_to_n(pos_g)
    302             edit_g = hgvs.edit.NARefAlt(
--> 303                 ref='', alt=self._get_altered_sequence(mapper.strand, pos_n, var_n))
    304         pos_g.uncertain = var_c.posedit.pos.uncertain
    305         var_g = hgvs.sequencevariant.SequenceVariant(

~/virtualenv/hgvs_v15/lib/python3.6/site-packages/hgvs/variantmapper.py in _get_altered_sequence(self, strand, interval, var)
    524
    525         if edit.type == 'sub':
--> 526             seq[pos_start] = edit.alt
    527         elif edit.type == 'del':
    528             del seq[pos_start:pos_end]

IndexError: list assignment index out of range
reece commented 4 years ago

@sptaylor: I can confirm this bug.

I accidentally discovered that this bug doesn't happen if I use a uta_20150827 (discovered through a sloppy copy-paste from old notes). That discovery only increases my curiosity.

I don't have funding, or time to donate, to work on this right now.

cassiemk commented 3 years ago

@reece we are experiencing the same error in a different context. when using the Assembly Mapper's g_to_c function with strict_bounds=True, this variant NC_000001.10:g.16890441C>G throws a HGVSInvalidIntervalError due to gaps in the sequence. when using the same function with strict_bounds=False we get

'Traceback (most recent call last): {some lines ommitted for brevity} File "/usr/local/lib/python3.6/site-packages/hgvs/assemblymapper.py", line 100, in g_to_c var_g, tx_ac, alt_aln_method=self.alt_aln_method) File "/usr/local/lib/python3.6/site-packages/hgvs/variantmapper.py", line 259, in g_to_c ref='', alt=self._get_altered_sequence(mapper.strand, pos_g, var_g)) File "/usr/local/lib/python3.6/site-packages/hgvs/variantmapper.py", line 526, in _get_altered_sequence seq[pos_start] = edit.alt IndexError: list assignment index out of range'

We propose a try/except in the else statement code block at line 225 of variantmapper.py (also below, hgvs1.5.1) to raise the HGVSInvalidInteralError. Does this make sense or is there a deeper issue?

'if not pos_c.uncertain: edit_c = self._convert_edit_check_strand(mapper.strand, var_g.posedit.edit) if edit_c.type == 'ins' and pos_c.start.offset == 0 and pos_c.end.offset == 0 and pos_c.end - pos_c.start > 1: pos_c.start.base += 1 pos_c.end.base -= 1 edit_c.ref = '' else:

variant at alignment gap

        pos_g = mapper.c_to_g(pos_c)
        edit_c = hgvs.edit.NARefAlt(
            ref='', alt=self._get_altered_sequence(mapper.strand, pos_g, var_g))'
davmlaw commented 11 months ago

Validation problem?

Other tools seem to think this is an error, so perhaps it's a validation problem:

ClinGen allele registry: dies with "intronic position inside exon" Variant Validator: ExonBoundaryError: Position c.283-3 does not correspond with an exon boundary for transcript NM_001291722.1

Exon positions in UTA

Where is "c.283" ?

In [46]: var_c = hp.parse_hgvs_variant('NM_001291722.1:c.283C>T') # remove the 3
In [47]: am37.c_to_n(var_c)
Out[47]: SequenceVariant(ac=NM_001291722.1, type=n, posedit=422C>T, gene=None)

Double check - NM_001291722.1 CDS start = 139 139 + 283 = transcript pos 422 so c_to_n is correct here

GRCh37 Tx exons around 422 are:

['CYFIP2', 'NM_001291722.1', 'NC_000005.9', 'splign', 1, 3, 346, 424, 156721791, 156721866, '75=3D', None, None, 738621, 366974, 6225535, 3655233, 3833217]
['CYFIP2', 'NM_001291722.1', 'NC_000005.9', 'splign', 1, 4, 424, 526, 156723677, 156723782, '3I102=', None, None, 738621, 366974, 6225536, 3655234, 3833382]

It looks like perhaps UTA is calling the exon wrong here? Why have a deletion of 3 bases at exon/intron boundary rather than shift the exon boundary?


Other investigation

I started to look into this and noticed something...

var_c = hp.parse_hgvs_variant('NM_001291722.1:c.283-3C>T')
mapper = am37._fetch_AlignmentMapper("NM_001291722.1", "NC_000005.9", "splign")

pos_c = var_c.posedit.pos
pos_c_to_n = mapper.c_to_n(pos_c)
pos_g = mapper.c_to_g(pos_c)
pos_c_to_g_to_n = mapper.g_to_n(pos_g)

print(pos_c_to_n)
print(pos_c_to_g_to_n)

These should be the same? Output:

422-3
418_419

Not sure if related, but I've found a few other examples of where conversion seems inconsistent (if you convert from c to g then back to c it's changed)

# These examples are from ./tests/data/clinvar.gz so presumably real HGVS from the wild
# Yes, they are probably wrong, eg "NM_000180.3:c.3233-3236dup" is probably a typo, dash is supposed to be underscore
example_hgvs = [
    'NM_001145026.1:c.715A>G',
    'NM_004543.4:c.7432-2025_7536+372del2502',
    'NM_000180.3:c.3233-3236dup',
    'NM_032383.3:c.1303+1G>A',
    'NM_001011.3:c.148+1G>A',
    'NM_000158.3:c.2052+5289_2052+5297delGTGTGGTGGinsTGTTTTTTACATGACAGGT',
    'NM_003611.2:c.2122-2125dupAAGA',
    'NM_001017975.4:c.1687-1G>C',
    'NM_006904.6:c.1776+1523dupA',
    'NM_001122679.1:c.2968-2A>T'
]

for hgvs_str in example_hgvs:
    var_c = hp.parse_hgvs_variant(hgvs_str)
    var_g = am37.c_to_g(var_c)
    var_c2 = am37.g_to_c(var_g, var_c.ac)  # This should in theory be the same as we started
    print(var_c)
    print(var_c2)
    print("-" * 20)

Between the dashed lines, the two should be consistent

NM_001145026.1:c.715A>G
NM_001145026.1:c.661-2_661-1insGAGAATAGTGAATCTTTTTTATGGAGTACAGCCAGCCCTTCTCCAACCCTTGGTGGAGTTACACCTCCATCGCGTACCACACATTCATCAAGCACGTTGACACAGAATGAGATCAGCTCTGTGTGGAAAGAGCCTATCAGTTTTGTAGTGACACACTTGAG
--------------------
NM_004543.4:c.7432-2025_7536+372del
NM_004543.4:c.7431+1917_7536+372del
--------------------
NM_000180.3:c.3233-3236dup
NM_000180.3:c.2226dup
--------------------
NM_032383.3:c.1303+1G>A
NM_032383.3:c.1304T>A
--------------------
NM_001011.3:c.148+1G>A
NM_001011.3:c.149=
--------------------
NM_000158.3:c.2052+5289_2052+5297delinsTGTTTTTTACATGACAGGT
NM_000158.3:c.2053-3358_2053-3350delinsTGTTTTTTACATGACAGGT
--------------------
NM_003611.2:c.2122-2125dup
NM_003611.2:c.1654+9dup
--------------------
NM_001017975.4:c.1687-1G>C
NM_001017975.4:c.1686G>C
--------------------
NM_006904.6:c.1776+1523dup
NM_006904.6:c.1777-723dup
--------------------
NM_001122679.1:c.2968-2A>T
NM_001122679.1:c.2966G>T

Some of these are def invalid, though in that case, they should throw an error?

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 8 months ago

This issue was closed because it has been stalled for 7 days with no activity.

reece commented 6 months ago

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.