biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
233 stars 94 forks source link

`g_to_t` for promoter regions #724

Closed markgene closed 4 months ago

markgene commented 4 months ago

Thank you for developing the package. I really like it. However, I get frustrated when trying to convert variants at promoter region. For example, given a TERT mutation, it will return HGVSInvalidIntervalError as the code below in hgvs-shell (v1.5.4):

g = hp.parse_hgvs_variant('NC_000005.9:g.1295228G>A')
tx_acs = am37.relevant_transcripts(g)
tx_acs

# []

hgvs_t = am37.g_to_t(g, 'NM_198253.2')

# HGVSInvalidIntervalError: Position is beyond the bounds of transcript record

Is there an existing solution I missed? Is that possible to provide a solution?

It will be very valuable, as there are a substantial number of variants that fall into promoter or other regulatory regions, not only TERT. For example, an early paper showed 1.5% mutations of HGMD locate in regulatory regions:

this database contains a total of 73 411 registered mutations (assembly date September 2007), of which 1.5% are regulatory.

-Mark

davmlaw commented 4 months ago

Current HGVS recommendations state:

5' and 3' flanking sequences are considered to be outside the boundaries of a transcript reference sequence and can not be used to describe variants

I know there are a lot of invalid HGVS's out there in the wild - I can see that maybe you'd want to be lax and allow parsing them, but we probably shouldn't be generating them and increasing the amount of invalid HGVSs in the world, you should just use g. coordinates here.

markgene commented 4 months ago

I see your point and agree with you. Thanks!