Closed IsmailM closed 7 years ago
Hi @IsmailM
We changed the way we handle metadata in the cache late in 2014 and this is now required when checking what data is available. Release 75 was before this change, so the metadata is not available. Do you need to use cache version 75? GRCh37 is supported in later releases.
Many thanks for clarifying that. That's useful to know.
I also fired off an email to CADD support - they said:
CADD needs to be run with the "ancient" version of VEP and the corresponding cache. It has been trained on this exact version of annotations and scores. Results will not be valid otherwise.
Thus I need to install and run ensembl-tools-release-76 just for CADD :(
The BioPerl URL is incorrect in the INSTALL.pl
in that branch - see here. It should be something like:
$BIOPERL_URL ||= 'https://cpan.metacpan.org/authors/id/C/CJ/CJFIELDS/BioPerl-1.6.1.tar.gz';
Any chance you could update that so that others don't have the same difficulty.
Many Thanks Ismail M
Hello, I am getting the same error with API verison 94 and data version 92. I am on Mac OSX 10.11.6 el Capital, and VEP was installed by this doc: https://gist.github.com/ckandoth/5390e3ae4ecf182fa92f6318cfa9fa97
$ ./vep --species homo_sapiens --assembly GRCh37 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/$VER\_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file examples/homo_sapiens_GRCh37.vcf --output_file examples/homo_sapiens_GRCh37.vep.vcf --polyphen b --af --af_1kg --af_esp --regulatory
-------------------- EXCEPTION --------------------
MSG: ERROR: SIFT not available
STACK Bio::EnsEMBL::VEP::AnnotationSource::Cache::Transcript::check_sift_polyphen /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/Cache/Transcript.pm:168
STACK Bio::EnsEMBL::VEP::AnnotationSource::Cache::Transcript::new /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/Cache/Transcript.pm:121
STACK Bio::EnsEMBL::VEP::CacheDir::get_all_AnnotationSources /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:142
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all_from_cache /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:121
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:93
STACK Bio::EnsEMBL::VEP::BaseRunner::get_all_AnnotationSources /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:175
STACK Bio::EnsEMBL::VEP::Runner::init /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:123
STACK Bio::EnsEMBL::VEP::Runner::run /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel ./vep:224
Date (localtime) = Wed Nov 7 09:46:49 2018
Ensembl API version = 94
---------------------------------------------------
Hi, vep checks if sift is available by reading the info.txt file in the cache directory. Could you please send me the content of that file? The file is under path_to_your_cache_directory/homo_sapiens/92_GRCh37/info.txt Thanks, Anja
$ cat .vep/homo_sapiens/92_GRCh37/info.txt
species homo_sapiens
assembly GRCh37
sift b
polyphen b
source_polyphen 2.2.2
source_sift sift5.2.2
source_genebuild 2011-04
source_gencode GENCODE 19
source_assembly GRCh37.p13
variation_cols variation_name,failed,somatic,start,end,allele_string,strand,minor_allele,minor_allele_freq,clin_sig,phenotype_or_disease,pubmed,AFR,AMR,EAS,EUR,SAS,AA,EA,gnomAD,gnomAD_AFR,gnomAD_AMR,gnomAD_ASJ,gnomAD_EAS,gnomAD_FIN,gnomAD_NFE,gnomAD_OTH,gnomAD_SAS
source_COSMIC 81
source_HGMD-PUBLIC 20164
source_ESP 20141103
source_ClinVar 201706
source_dbSNP 150
source_1000genomes phase3
source_gnomAD 170228
regulatory 1
cell_types A549,Aorta,B_cells_(PB)_Roadmap,CD14+CD16-_monocyte_(CB),CD14+CD16-_monocyte_(VB),CD4+_ab_T_cell_(VB),CD8+_ab_T_cell_(CB),CM_CD4+_ab_T_cell_(VB),DND-41,EPC_(VB),Fetal_Adrenal_Gland,Fetal_Intestine_Large,Fetal_Intestine_Small,Fetal_Muscle_Leg,Fetal_Muscle_Trunk,Fetal_Stomach,Fetal_Thymus,GM12878,Gastric,H1-mesenchymal,H1-neuronal_progenitor,H1-trophoblast,H1ESC,H9,HMEC,HSMM,HSMMtube,HUVEC,HUVEC_prol_(CB),HeLa-S3,HepG2,IMR90,K562,Left_Ventricle,Lung,M0_macrophage_(CB),M0_macrophage_(VB),M1_macrophage_(CB),M1_macrophage_(VB),M2_macrophage_(CB),M2_macrophage_(VB),MSC_(VB),Monocytes-CD14+,Monocytes-CD14+_(PB)_Roadmap,NH-A,NHDF-AD,NHEK,NHLF,Natural_Killer_cells_(PB),Osteobl,Ovary,Pancreas,Placenta,Psoas_Muscle,Right_Atrium,Small_Intestine,Spleen,T_cells_(PB)_Roadmap,Thymus,eosinophil_(VB),erythroblast_(CB),iPS-20b,iPS_DF_19.11,iPS_DF_6.9,naive_B_cell_(VB),neutrophil_(CB),neutrophil_(VB),neutrophil_myelocyte_(BM)
source_regbuild 1.0
OK. That looks good. Is there a reason why you are running API version with cache version 92? We do recommend to use the same cache version as API version. Can you please try running with --cache_version 92?
No particular reason, just trying not to deviate from instructions in https://gist.github.com/ckandoth/5390e3ae4ecf182fa92f6318cfa9fa97 which uses VER=92
In general, I don't really need VEP, I am installing it as a pre-req for vcf2maf which I am after ( https://github.com/mskcc/vcf2maf/blob/master/README.md ).
So to get datasets of version 94 I would need set export VER=94
, then download all the files with 94 in the name instead of 92?
Sorry, but we need to take a step back. If you followed the installation instructions why are using the API version 94? The error you get indicates that VEP can read the info file in the cache directory. However this also means that you have cache files for version 94 installed? Otherwise I would expect an error like MSG: ERROR: No cache found for homo_sapiens, version 94. So I would recommend that you check your versions of API and cache files and use the same version for both for your analysis.
I am not sure if I am using API version 94, I think that because of the error message I am getting Ensembl API version = 94
(see above where I started) . Is there a command to get the version? -v or --version do not give me that.
I see. Can you please go into your ensembl-vep directory and type perl INSTALL -help. This should print the version at the top of the output. Thanks
Here it is:
$ perl INSTALL.pl -help
#---------------#
# VEP INSTALLER #
#---------------#
versions
ensembl : 94.5c08d90
ensembl-funcgen : 94.08b0c13
ensembl-io : 94.8d53275
ensembl-variation : 94.066b102
ensembl-vep : 94.5
BTW, I am downloading datasets v.94 right now, but that will take couple of hours...
It is good that you have version 94 installed. I'm just wondering how you got there if you followed the instructions here: https://gist.github.com/ckandoth/5390e3ae4ecf182fa92f6318cfa9fa97 Is it also possible that you have 94 cache files? Do you have any directories called 94_GRCh37? That should be under path_to_your_cache_directory/homo_sapiens/. If not I would suggest that you manually download ftp://ftp.ensembl.org/pub/release-94/variation/VEP/homo_sapiens_vep_94_GRCh37.tar.gz and run tar xzf homo_sapiens_vep_94_GRCh37.tar.gz After those steps test your script again. If the error disappears you can run the convert_cache script as described in your instructions. Let me know how it goes.
Did by any chance running your command with --cache_version 92 work?
adding --cache_version 92 does not change anything
Now I downloaded 94 cache files and ran again the following:
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-$VER/variation/VEP/homo_sapiens_vep_$VER\_GRCh37.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-$VER/variation/VEP/homo_sapiens_vep_$VER\_GRCh38.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-$VER/variation/VEP/mus_musculus_vep_$VER\_GRCm38.tar.gz $VEP_DATA
cat $VEP_DATA/*_vep_$VER\_GRC{h37,h38,m38}.tar.gz | tar -izxf - -C $VEP_DATA
perl INSTALL.pl --AUTO a --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA --NO_HTSLIB
perl INSTALL.pl --AUTO f --SPECIES homo_sapiens --ASSEMBLY GRCh37 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO f --SPECIES homo_sapiens --ASSEMBLY GRCh38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO f --SPECIES mus_musculus --ASSEMBLY GRCm38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
The error changed:
$ ./vep --species mus_musculus --assembly GRCm38 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/mus_musculus/$VER\_GRCm38/Mus_musculus.GRCm38.dna.toplevel.fa.gz --input_file examples/mus_musculus_GRCm38.vcf --output_file examples/mus_musculus_GRCm38.vep.vcf --regulatory
-------------------- EXCEPTION --------------------
MSG: ERROR: Cannot index bgzipped FASTA file with Bio::DB::Fasta
STACK Bio::EnsEMBL::Variation::Utils::FastaSequence::setup_fasta /Users/ipozdnya/ensembl-vep/Bio/EnsEMBL/Variation/Utils/FastaSequence.pm:210
STACK Bio::EnsEMBL::VEP::BaseVEP::fasta_db /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseVEP.pm:475
STACK Bio::EnsEMBL::VEP::Runner::init /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:129
STACK Bio::EnsEMBL::VEP::Runner::run /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel ./vep:224
Date (localtime) = Thu Nov 8 19:09:09 2018
Ensembl API version = 94
Can you please unzip your FASTA file and just gzip it? I noticed that you switched species to --species mus_musculus. Did you also install the cache files for mouse?
cannot index bgzipped FASTA file with Bio::DB::Fasta
./vep --species homo_sapiens --assembly GRCh37 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/$VER\_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file examples/homo_sapiens_GRCh37.vcf --output_file examples/homo_sapiens_GRCh37.vep.vcf --polyphen b --af --af_1kg --af_esp --regulatory
cannot index bgzipped FASTA file with Bio::DB::Fasta
error: [(https://github.com/mskcc/vcf2maf/issues/97)] . It suggests that a newer version of perl is required but that would be a dead end for me as I can not control perl version in all the environments involved. It sounds like your FASTA file is bgzipped: Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz I suggest that you unzip it and then just use gzip for compressing the FASTA file again.
bzip error seems to be a manifestation of some other problem, not use of bzip2. Might be that perl version thing after all First, the extension is .gz, not .bzip. Second, upzipping and zipping again did not change anything - same bzip complain Third, checkng the file tells me it's a regular gzipped:
$ file Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz: gzip compressed data, was "Homo_sapiens.GRCh37.75.dna.prim", from Unix, last modified: Wed Nov 7 17:08:27 2018
When I try to run the vf2maf test (which is the goal of my whole activity) with --ref-fasta specified in the command line, there is another error:
$ perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf --ref-fasta Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
`[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip`
[faidx] Could not load fai index of /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 1:11290178 from reference FASTA: /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
..several more similar warnings...
WARNING: Couldn't retrieve bps around 17:7579312 from reference FASTA: /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
STATUS: Running VEP and writing to: tests/test.vep.vcf
ERROR: Cannot find VEP script in path: /Users/ipozdnya/vep
Is it possible that my version of vcf2maf wants fasta to be bzipped while VEP wants it gzipped?
You can use bgzipped FASTA files with VEP if you have Bio::DB::HTS::Faidx installed. The code and installation instructions are here: https://github.com/Ensembl/Bio-DB-HTS/tree/release/v2.11
I figure out the error ERROR: Cannot find VEP script in path: /Users/ipozdnya/vep
from vcf2maf test run from the previous post: vcf2maf.pl expects vep executable to be in vep
subdir
my ( $vep_path, $vep_data, $vep_forks, $buffer_size, $any_allele ) = ( "$ENV{HOME}/ensembl-vep", "$ENV{HOME}/.vep", 4, 5000, 0 );
while on my machine vep gets installed in subdir ensemble-vep. After correcting the script to point to ensembl-vep
, I started to get the same error about bzip from vcf2maf.pl run as well:
$ perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf --ref-fasta /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip
[faidx] Could not load fai index of /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 1:11290178 from reference FASTA: /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
..bunch of similar warnings...
WARNING: Couldn't retrieve bps around 17:7579312 from reference FASTA: /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
STATUS: Running VEP and writing to: tests/test.vep.vcf
-------------------- EXCEPTION --------------------
**MSG: ERROR: Cannot index bgzipped FASTA file with Bio::DB::Fasta**
STACK Bio::EnsEMBL::Variation::Utils::FastaSequence::setup_fasta /Users/ipozdnya/ensembl-vep/Bio/EnsEMBL/Variation/Utils/FastaSequence.pm:210
STACK Bio::EnsEMBL::VEP::BaseVEP::fasta_db /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseVEP.pm:475
STACK Bio::EnsEMBL::VEP::Runner::init /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:129
STACK Bio::EnsEMBL::VEP::Runner::run /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel /Users/ipozdnya/ensembl-vep/vep:224
Date (localtime) = Fri Nov 9 12:46:26 2018
Ensembl API version = 94
---------------------------------------------------
ERROR: Failed to run the VEP annotator!
I'm currently running CADD locally which runs VEP internally:
Here is the section of the commands that CADD runs:
The above works perfectly with ensembl-tools-release-87 release from the ensembl vep website.
However, with the github release/87 branch, the VEP line fails with the following:
Is this because of changes to ensembl-vep or simply because CADD links to an older annotation db?