biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
233 stars 94 forks source link

Coordinate is out of bounds error when roundtripping GRCh38 variants via `g_to_c` + `c_to_g` and `relevant_transcripts` #717

Closed mihaitodor closed 5 months ago

mihaitodor commented 6 months ago

Describe the bug

I'm parsing a g. variant, then I'm fetching some relevant transcripts for it via the GRCh38 AssemblyMapper, then I'm projecting it to any of the NM relevant transcript and then I'm projecting that back to a g. variant. Naively, what I think should happen is that it should be able to round-trip and print the input variant, but right now it's just throwing this error. Note that it does work as expected if you use, for example variant = 'NC_000011.10:g.8263343T>C'. Also, it works if I use GRCh37.

To Reproduce

Steps to reproduce the behavior:

import hgvs.assemblymapper
import hgvs.dataproviders.uta
import hgvs.parser

database_schema = 'uta_20210129b'
data_provider = hgvs.dataproviders.uta.connect(
    db_url=f"postgresql://anonymous:anonymous@uta.biocommons.org/uta/{database_schema}")
b38_assembly_mapper = hgvs.assemblymapper.AssemblyMapper(
    data_provider, assembly_name='GRCh38', alt_aln_method='splign')
parser = hgvs.parser.Parser()

variant = 'NC_000001.10:g.145592073A>T'

parsed_variant = parser.parse_hgvs_variant(variant)

transcripts = b38_assembly_mapper.relevant_transcripts(parsed_variant)
print(transcripts)

relevant_transcript = 'NM_006468.7'
# relevant_transcript = 'NM_001303456.1'

var_c = b38_assembly_mapper.g_to_c(parsed_variant, relevant_transcript)

b38_projected_variant = b38_assembly_mapper.c_to_g(var_c)

print(b38_projected_variant)

Running the above code fails with the following error:

Traceback (most recent call last):
  File "/Users/mihaitodor/Projects/genomex/spdi/./test.py", line 25, in <module>
    b38_projected_variant = b38_assembly_mapper.c_to_g(var_c)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/hgvs/assemblymapper.py", line 115, in c_to_g
    var_out = super(AssemblyMapper, self).c_to_g(var_c, alt_ac, alt_aln_method=self.alt_aln_method)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/hgvs/variantmapper.py", line 283, in c_to_g
    pos_g = mapper.c_to_g(var_c.posedit.pos)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/hgvs/alignmentmapper.py", line 277, in c_to_g
    return self.n_to_g(self.c_to_n(c_interval), strict_bounds=strict_bounds)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/hgvs/alignmentmapper.py", line 263, in c_to_n
    n_start = pos_c_to_n(c_interval.start)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/hgvs/alignmentmapper.py", line 260, in pos_c_to_n
    raise HGVSInvalidIntervalError(f"c.{pos} coordinate is out of bounds")
hgvs.exceptions.HGVSInvalidIntervalError: c.*578 coordinate is out of bounds

Expected behavior

The code should print NC_000001.10:g.145592073A>T at the end.

Additional context

@reece mentioned on the biocommons Slack that this looks like a bug.

andreasprlic commented 5 months ago

Looking into this, it turns out that the hgvs_c string here is NM_006468.7:c.*578=, so the UTR region of this transcript (in which that variant lies) does have a different sequence than the reference genome. We call that a "transcript-ref-disagree" position.

Somehow in assembly 38 this position seems to be outside of transcript coding space, therefore there is a coordinate is out of bounds error message. To work around here, you could set hgvs.global_config.mapping.strict_bounds = False. Doing so results in NC_000001.11:g.145842814_145842815=

Looking into the underlying alignment in UTA, The last exon (that is fully coding for the UTR) has alignment issues in both assemblies. in assembly 37 this transcript has one exon more but the alignment is 183=1X377=3I1027=. Assembly 38 is one exon shorter and the second to last has an alignment of 476=1588I. I believe that's why that specific position comes out as "out of bounds".

I would conclude from that that neither assembly is a good match for the terminal UTR region here, and we should prob not assume that we can map variants in that region across assemblies.

I don't think there's a hgvs bug here, this is just the nature of the reference assemblies...

mihaitodor commented 5 months ago

Thank you for looking into this @andreasprlic and for providing such a detailed explanation!

Given this issue, do you think it would make sense to bubble up strict_bounds as an optional parameter for c_to_g in AssemblyMapper and VariantMapper? That way, users could disable this validation selectively. In my case, I'm OK to disable it globally and that seems to work fine.

gostachowiak commented 5 months ago

This issue was auto-closed due to inactivity, but I think it's related https://github.com/biocommons/hgvs/issues/608

andreasprlic commented 5 months ago

@mihaitodor if your goal is identifying if the location of a variant in another genome assembly, I am concerned about trying to do that in regions where the assemblies have changed a lot. Just because you can compute some coordinates does not mean these are biologically "the same". I really think in the example you provided above, the lifted over coordinates should not be trusted. As such I would not encourage you to map across assemblies when strict_bounds=False are necessary. I also would recommend to think about a QC approach to make sure that any lifted over variant can be considered to be equivalent in the context of both assemblies, otherwise treat them as distinct variants.

mihaitodor commented 5 months ago

Thank you @andreasprlic! Indeed, in such cases it's probably better to error out. Initially, we had some code based on pyliftover which I guess would have similar issues? I opened #711 after @reece mentioned that it would be handy to have some direct liftover support in hgvs.

The code I'm trying to add some features to is just a reference implementation for now, but indeed, we'll have to be stricter in production-ready implementations.