Closed chubukov closed 2 weeks ago
To resolve a HGVS you need the transcript alignment information as well as the transcript sequence
You have the transcript ok, your error above is getting the sequence from seqrepo.
I have some classes to build the transcript sequence from the genome build sequence:
https://github.com/SACGF/cdot/wiki/FastaSeqFetcher
This is fine with Ensembl as the transcripts match the genome reference
I hope you can work it out via the wiki link above, I think I will make the chained fetcher (try transcript, fall back on genome) the default in examples. Maybe adding a warning for refseq or if there are alignment mismatches etc
Thanks Dave. Is there a simpler path if I only want the protein change and don't care about the genome position? (c_to_p
currently gives the same error as above). I realize that's probably a question about the hgvs package, not cdot.
Hi, I don't use the c_to_p functionality in HGVS (I use Ensembl VEP), so that's not implemented in cdot. I've raised issue #83 to do so
I've made a note there to come back here and notify you when it's ready.
As a work around, possibly you could try
Hi, @chubukov I've made a new data release which hopefully fixes your issue (you don't need to upgrade your code version just the data) see:
https://github.com/SACGF/cdot/releases/tag/data_v0.2.27
import hgvs
from hgvs.assemblymapper import AssemblyMapper
from cdot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider, FastaSeqFetcher
cdot_filename = "/data/annotation/cdot_generation/ensembl/GRCh37/cdot-0.2.27.ensembl.grch37.json.gz"
fasta_filename = "/data/annotation/fasta/GCF_000001405.25_GRCh37.p13_genomic.fna.gz"
seqfetcher = FastaSeqFetcher(fasta_filename)
hdp = JSONDataProvider([filename], seqfetcher=seqfetcher)
am = AssemblyMapper(hdp,
assembly_name='GRCh37',
alt_aln_method='splign', replace_reference=True)
hp = hgvs.parser.Parser()
var_c = hp.parse_hgvs_variant("ENST00000256078.4:c.34G>T")
So this now works
In [17]: am.c_to_g(var_c)
Out[17]: SequenceVariant(ac=NC_000012.11, type=g, posedit=25398285C>A, gene=None)
In [18]: am.c_to_p(var_c)
Out[18]: SequenceVariant(ac=ENSP00000256078.4, type=p, posedit=(Gly12Cys), gene=None)
Feel free to re-open this issue if you have any issues.
@chubukov also, in your example code:
hdp = JSONDataProvider(["cdot-0.2.26.ensembl.grch37.json.gz","cdot-0.2.26.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz"]) # Uses local JSON file
You don't need to include the second JSON.gz file as the 1st one includes all of the GRCh37 transcripts (including the ones in 87). Won't hurt, but will just take a bit longer to load
Thanks very much @davmlaw. Will try this again next week.
This works, thanks!
Two small notes
hdp = JSONDataProvider([filename], seqfetcher=seqfetcher)
should be hdp = JSONDataProvider([cdot_filename], seqfetcher=seqfetcher)
I downloaded GCF_000001405.25_GRCh37.p13_genomic.fna.gz from NCBI but then had to gunzip
it and bgzip
it
I'm trying to follow the example on the front page but with an Ensembl transcript ID instead of RefSeq.
I get the same error whether I use the REST API or the static JSONs
Despite the fact that
Am I using the tool correctly?