biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
233 stars 94 forks source link

Consider exposing a utility for converting between g. variants by leveraging pyliftover #711

Open mihaitodor opened 7 months ago

mihaitodor commented 7 months ago

Workaround using the existing functionality: If I have, say var_g = hgvs.parser.Parser().parse_hgvs_variant('NC_000005.9:g.112090651A>G') and I want to do liftover between assemblies, one way that seems to work is to call vm.relevant_transcripts(var_g) and pick a random relevant transcript (that starts with NM_) using the GRCh38 AssemblyMapper, then do var_c = vm.g_to_c(parsed_variant, relevant_transcript) and then use the AssemblyMapper associated with GRCh37 to call c_to_g(var_c). However, it feels clunky and I'm not sure if picking a random "relevant transcript" makes sense in the general case.

Thank you @rhdolin for figuring out this workaround!

korikuzma commented 7 months ago

Hi @mihaitodor . We do something similar in cool-seq-tool. Here is what we use to select a preferred transcript. cool-seq-tool has methods already for doing this liftover.

At some point (when it's a little bit more cleaned up), I would like to propose for cool-seq-tool to move into the biocommons workspace.

mihaitodor commented 7 months ago

Hi @korikuzma 👋 Thank you for letting me know! I guess I'll need more background to grasp the reasoning behind "Representative transcript priority" in that document, but it's probably best for me to use this library when it's available instead of trying to replicate it in my code. Do let me know if you need any help.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 4 months ago

This issue was closed because it has been stalled for 7 days with no activity.

mihaitodor commented 3 months ago

I think we bumped into an edge case where this approach doesn't work. One example is NC_000010.10:g.12307894C>T, for which relevant_transcripts() returns an empty list.

Kallepan commented 3 months ago

I have had problems with this approach as well. Usually when the variant is outside of the coordinates covered by the transcript. Usually denoted by a +- in the cDNA change (e.g: c.308+1G>T)