hgvs c_to_p failing with Ensembl transcript

chubukov commented 3 weeks ago

I'm trying to follow the example on the front page but with an Ensembl transcript ID instead of RefSeq.

I get the same error whether I use the REST API or the static JSONs

import hgvs
from hgvs.assemblymapper import AssemblyMapper
from cdot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider

#hdp = RESTDataProvider()  # Uses API server at cdot.cc
hdp = JSONDataProvider(["cdot-0.2.26.ensembl.grch37.json.gz","cdot-0.2.26.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz"])  # Uses local JSON file

am = AssemblyMapper(hdp,
                    assembly_name='GRCh37',
                    alt_aln_method='splign', replace_reference=True)

hp = hgvs.parser.Parser()
#var_c = hp.parse_hgvs_variant('NM_001637.3:c.1582G>A')
var_c = hp.parse_hgvs_variant("ENST00000256078.4:c.34G>T")
am.c_to_g(var_c)

HGVSDataNotAvailableError: Failed to fetch ENST00000256078.4 from bioutils.seqfetcher (network fetching) ('Ensembl API only provides ENST00000256078 version (10), requested: 4')

Despite the fact that

$ zgrep -c ENST00000256078.4 cdot*
cdot-0.2.26.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz:1
cdot-0.2.26.ensembl.grch37.json.gz:1

Am I using the tool correctly?

davmlaw commented 2 weeks ago

To resolve a HGVS you need the transcript alignment information as well as the transcript sequence

You have the transcript ok, your error above is getting the sequence from seqrepo.

I have some classes to build the transcript sequence from the genome build sequence:

https://github.com/SACGF/cdot/wiki/FastaSeqFetcher

This is fine with Ensembl as the transcripts match the genome reference

I hope you can work it out via the wiki link above, I think I will make the chained fetcher (try transcript, fall back on genome) the default in examples. Maybe adding a warning for refseq or if there are alignment mismatches etc

chubukov commented 2 weeks ago

Thanks Dave. Is there a simpler path if I only want the protein change and don't care about the genome position? (c_to_p currently gives the same error as above). I realize that's probably a question about the hgvs package, not cdot.

davmlaw commented 2 weeks ago

Hi, I don't use the c_to_p functionality in HGVS (I use Ensembl VEP), so that's not implemented in cdot. I've raised issue #83 to do so

I've made a note there to come back here and notify you when it's ready.

As a work around, possibly you could try

Convert to genomic coords / VCF then use an annotator like Ensembl VEP
potentially try changing transcript versions to one that has a protein associated with it in non-cdot HGVS library? If you can convert it to g.HGVS in cdot, then convert it back to a different c. transcript version using UTA, and the coding position and change is the same as what you started with, then it should be safe?

davmlaw commented 2 weeks ago

Hi, @chubukov I've made a new data release which hopefully fixes your issue (you don't need to upgrade your code version just the data) see:

https://github.com/SACGF/cdot/releases/tag/data_v0.2.27

import hgvs
from hgvs.assemblymapper import AssemblyMapper
from cdot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider, FastaSeqFetcher

cdot_filename = "/data/annotation/cdot_generation/ensembl/GRCh37/cdot-0.2.27.ensembl.grch37.json.gz"
fasta_filename = "/data/annotation/fasta/GCF_000001405.25_GRCh37.p13_genomic.fna.gz"

seqfetcher = FastaSeqFetcher(fasta_filename)
hdp = JSONDataProvider([filename], seqfetcher=seqfetcher)

am = AssemblyMapper(hdp,
                    assembly_name='GRCh37',
                    alt_aln_method='splign', replace_reference=True)

hp = hgvs.parser.Parser()
var_c = hp.parse_hgvs_variant("ENST00000256078.4:c.34G>T")

So this now works

In [17]: am.c_to_g(var_c)
Out[17]: SequenceVariant(ac=NC_000012.11, type=g, posedit=25398285C>A, gene=None)

In [18]: am.c_to_p(var_c)
Out[18]: SequenceVariant(ac=ENSP00000256078.4, type=p, posedit=(Gly12Cys), gene=None)

Feel free to re-open this issue if you have any issues.

davmlaw commented 2 weeks ago

@chubukov also, in your example code:

hdp = JSONDataProvider(["cdot-0.2.26.ensembl.grch37.json.gz","cdot-0.2.26.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz"])  # Uses local JSON file

You don't need to include the second JSON.gz file as the 1st one includes all of the GRCh37 transcripts (including the ones in 87). Won't hurt, but will just take a bit longer to load

chubukov commented 2 weeks ago

Thanks very much @davmlaw. Will try this again next week.

chubukov commented 1 week ago

This works, thanks!

Two small notes hdp = JSONDataProvider([filename], seqfetcher=seqfetcher) should be hdp = JSONDataProvider([cdot_filename], seqfetcher=seqfetcher)

I downloaded GCF_000001405.25_GRCh37.p13_genomic.fna.gz from NCBI but then had to gunzip it and bgzip it

SACGF / cdot

hgvs c_to_p failing with Ensembl transcript #82