HGVS using ENSEMBL ID is not working

alex97751 commented 3 years ago

How come parsing a variant in HGVS notation with an ENSEMBL Identifier (EBI) does not work when the HGVS notation explicitly states that they can be used as a reference sequence in HGVS?

"reference sequences must come from data sources that provide stable and permanent identifiers, e.g. RefSeq (NCBI) and Ensembl (EBI)"

The HGVS documentation even recommends the use of Ensembl transcripts.

Recommended Reference Sequences types are: -RefSeq sequences with the prefixes NC, NT, NW,NG, NM, NR or NP -LRG sequences with the prefixes LRG#, LRG#t#, LRG#p#, -Ensembl transcript (ENST) and protein (ENSP) which are not identified by Ensembl as being incomplete

HGVS Reference Sequences

It first has to be converted from ENSEMBL (ENST..) to a RefSeq ID (NM.. ) in order to work. Even tools such as Mutalyzer or VariantValidator are unable to validate or even parse them.

import hgvs.parser
import hgvs.validator
import hgvs.dataproviders.uta

def VariantHGVS(variant_HGVS):
    v = parseVariant(variant_HGVS)
    print("Is valid: " + str(validate(v)))
    return

def parseVariant(variant):
    hp = hgvs.parser.Parser()
    var_g = hp.parse_hgvs_variant(variant)

    return var_g

def validate(var_g):
    hdp = hgvs.dataproviders.uta.connect()
    vr = hgvs.validator.Validator(hdp)
    try:
        return vr.validate(var_g)
    except hgvs.exceptions.HGVSError as e:
        return False

VariantHGVS("NM_007313.2:c.1001C>T")  # TRUE
VariantHGVS("ENST00000372348.2:c.1001C>T")  # FALSE

Just a short code snippet that explains what i mean. First we run the code on a variant with a RefSeq ID (NM_007313.2:c.1001C>T) and then we run the same code with the corresponding ENS ID (ENST00000372348.2:c.1001C>T).

The mapping should be correct as it has been done by using a local copy of the ENSEMBLEDB (ensembldb.ensembl.org). You can also use their RESP API to check for yourself: rest.ensembl.org

What am i missing here? Maybe someone can help me out :)

reece commented 3 years ago

There are several reasons this doesn't work.

The first is that the ensembl data in UTA are very stale. That transcript isn't there. The reasons for this are in https://github.com/biocommons/uta/issues/233.

You capture an exception but just return False. When you do that, you're burying the message that likely would have explained what's happening. At the very least, print or log the exception before ignoring it.

You may want to see this old talk about the origins of UTA and some of the dark corners of the false equivalence of some transcripts.

davmlaw commented 1 year ago

Hi, as well as the transcript, the sequence is not in SeqRepo:

HGVSDataNotAvailableError: Failed to fetch ENST00000372348.2 from bioutils.seqfetcher (network fetching) ('Ensembl API only provides ENST00000372348 version (9), requested: 2')

To resolve it, you can use the cdot library/data to provide the transcript mapping coordinates and sequence

Install via python3 -m pip install cdot

import hgvs.parser
import hgvs.validator

from cdot.hgvs.dataproviders import RESTDataProvider, FastaSeqFetcher

# see how to get fasta: https://github.com/SACGF/cdot/wiki/FastaSeqFetcher
seqfetcher = FastaSeqFetcher("/data/annotation/fasta/GCF_000001405.25_GRCh37.p13_genomic.fna.gz")
hdp = RESTDataProvider(seqfetcher=seqfetcher)
hp = hgvs.parser.Parser()
var_c = hp.parse_hgvs_variant("ENST00000372348.2:c.1001C>T")
vr = hgvs.validator.Validator(hdp)
valid = vr.validate(var_c)
print(f"Valid: {valid}")

from hgvs.assemblymapper import AssemblyMapper
am = AssemblyMapper(hdp,
                    assembly_name='GRCh37',
                    alt_aln_method='splign', replace_reference=True)
am.c_to_g(var_c)
print(f"Resolves to {am.c_to_g(var_c)}")

Output:

Valid: True
Resolves to NC_000009.11:g.133748283C>T

biocommons / hgvs

HGVS using ENSEMBL ID is not working #621