biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
240 stars 94 forks source link

mitochondria (m.) assembly mapping #663

Open davmlaw opened 1 year ago

davmlaw commented 1 year ago

RefSeq does not have MT transcripts, but Ensembl does, eg ENST00000361381.2

There is currently no "m" (ie m_to_c) etc methods in AssemblyMapper, so you can't convert these easily.

Q. Should we add m_to_x and x_to_m methods in AssemblyMapper?

var_m = parser.parse("NC_012920.1:m.11038delA")
am.g_to_c(var_m, "ENST00000361381.2")
File /usr/local/lib/python3.10/dist-packages/hgvs/variantmapper.py:235, in VariantMapper.g_to_c(self, var_g, tx_ac, alt_aln_method)
    222 """Given a parsed g. variant, return a c. variant on the specified
    223 transcript using the specified alignment method (default is
    224 "splign" from NCBI).
   (...)
    231 
    232 """
    234 if not (var_g.type == "g"):
--> 235     raise HGVSInvalidVariantError("Expected a g. variant; got " + str(var_g))
    236 if self._validator:
    237     self._validator.validate(var_g)

HGVSInvalidVariantError: Expected a g. variant; got NC_012920.1:m.11038del

It's pretty trivial to get it to work, by switching the 'm' to a 'g' (this is probably how you'd implement it)

var_g = deepcopy(var_m)
var_g.type = 'g'
In [19]: print(am.g_to_c(var_g, "ENST00000361381.2"))
ENST00000361381.2:c.279del

I don't think you can reproduce this using UTA as it doesn't have Ensembl transcripts, but here's the cdot code:

import hgvs.assemblymapper

from copy import deepcopy
from hgvs.parser import Parser
from cdot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider, FastaSeqFetcher

seqfetcher = FastaSeqFetcher("/data/annotation/fasta/GCF_000001405.39_GRCh38.p13_genomic.fna.gz")
hdp = JSONDataProvider(["/data/annotation/cdot/ensembl/GRCh38/cdot-0.2.17.ensembl.grch38.json.gz"], seqfetcher=seqfetcher)
am = hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name='GRCh38', alt_aln_method='splign', replace_reference=True)

parser = Parser()

var_m = parser.parse("NC_012920.1:m.11038delA")
am.g_to_c(var_m, "ENST00000361381.2")
AngieHinrichs commented 1 year ago

There are a few codons that are translated differently by mitochondria: https://en.wikipedia.org/wiki/Human_mitochondrial_genetics#Genetic_code_variants so the wrapper would need a bit more code to deal with those.

andreasprlic commented 1 year ago

HGVS uses the translate_cds method from bioutils. That one already supports alternative translation tables (eg. for selenoproteins). We need another translation table there for mitochondria. That would be similar to this. Then this needs to get enabled with the AltTranscriptData / AltSeqBuilder somehow.

About "Refseq does not have MT transcripts": Would this data be sufficient for us to offer "m_to_p"?

andreasprlic commented 1 year ago

A comment on the initial request: the transcript model for mitochondria is prob a slightly different for mitochondria. I am not sure if we would express mitochondrial variants using c. nomenclature. I think people would refer to them as m. or p. variants. See some test variants that have been provided as part of https://github.com/biocommons/hackathon-2023/issues/4 . We won't get to work on the mito ticket as part of the hackathon, but will follow up on this afterwards.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 10 months ago

This issue was closed because it has been stalled for 7 days with no activity.

reece commented 7 months ago

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.