Closed alex97751 closed 3 years ago
There are several reasons this doesn't work.
The first is that the ensembl data in UTA are very stale. That transcript isn't there. The reasons for this are in https://github.com/biocommons/uta/issues/233.
You capture an exception but just return False. When you do that, you're burying the message that likely would have explained what's happening. At the very least, print or log the exception before ignoring it.
You may want to see this old talk about the origins of UTA and some of the dark corners of the false equivalence of some transcripts.
Hi, as well as the transcript, the sequence is not in SeqRepo:
HGVSDataNotAvailableError: Failed to fetch ENST00000372348.2 from bioutils.seqfetcher (network fetching) ('Ensembl API only provides ENST00000372348 version (9), requested: 2')
To resolve it, you can use the cdot library/data to provide the transcript mapping coordinates and sequence
Install via python3 -m pip install cdot
import hgvs.parser
import hgvs.validator
from cdot.hgvs.dataproviders import RESTDataProvider, FastaSeqFetcher
# see how to get fasta: https://github.com/SACGF/cdot/wiki/FastaSeqFetcher
seqfetcher = FastaSeqFetcher("/data/annotation/fasta/GCF_000001405.25_GRCh37.p13_genomic.fna.gz")
hdp = RESTDataProvider(seqfetcher=seqfetcher)
hp = hgvs.parser.Parser()
var_c = hp.parse_hgvs_variant("ENST00000372348.2:c.1001C>T")
vr = hgvs.validator.Validator(hdp)
valid = vr.validate(var_c)
print(f"Valid: {valid}")
from hgvs.assemblymapper import AssemblyMapper
am = AssemblyMapper(hdp,
assembly_name='GRCh37',
alt_aln_method='splign', replace_reference=True)
am.c_to_g(var_c)
print(f"Resolves to {am.c_to_g(var_c)}")
Output:
Valid: True
Resolves to NC_000009.11:g.133748283C>T
How come parsing a variant in HGVS notation with an ENSEMBL Identifier (EBI) does not work when the HGVS notation explicitly states that they can be used as a reference sequence in HGVS?
The HGVS documentation even recommends the use of Ensembl transcripts.
HGVS Reference Sequences
It first has to be converted from ENSEMBL (ENST..) to a RefSeq ID (NM.. ) in order to work. Even tools such as Mutalyzer or VariantValidator are unable to validate or even parse them.
Just a short code snippet that explains what i mean. First we run the code on a variant with a RefSeq ID (NM_007313.2:c.1001C>T) and then we run the same code with the corresponding ENS ID (ENST00000372348.2:c.1001C>T).
The mapping should be correct as it has been done by using a local copy of the ENSEMBLEDB (ensembldb.ensembl.org). You can also use their RESP API to check for yourself: rest.ensembl.org
What am i missing here? Maybe someone can help me out :)