Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
444 stars 151 forks source link

SIFT not available #33

Closed IsmailM closed 7 years ago

IsmailM commented 7 years ago

I'm currently running CADD locally which runs VEP internally:

Here is the section of the commands that CADD runs:

VEPpath=/home/ucbtmog/data/.progz/ensembl-vep/vep.pl
CADDpath=/home/ucbtmog/data/analysis/CADD
INPUT=test.vcf.gz
OUTPUT=out.tsv.gz

zcat ${INPUT} | \
${CADDpath}/bin/src/VCF2vepVCF.py |  \
sort -k1,1 -k2,2n -k3,3 -k4,4 |  \
uniq | \
${CADDpath}/bin/src/extract_scored.py -p ${CADDpath}/prescored/whole_genome_SNVs.tsv.gz --found_out=>( gzip -c > ${OUTPUT}_p2.tmp ) | \
${CADDpath}/bin/src/extract_scored.py -p ${CADDpath}/prescored/InDels.tsv.gz --found_out=>( gzip -c > ${OUTPUT}_p3.tmp ) | \
perl ${VEPpath} --quiet --cache --offline --dir ${CADDpath}/annotations/vep/ --buffer 1000 --no_stats --species homo_sapiens --db_version=75 --format vcf --regulatory --sift b --polyphen b --per_gene --ccds --domains --numbers --canonical --total_length --output_file >( awk 'BEGIN{ FS="\t"; OFS="\t"; }{ if ($1 ~ /^#/) { if ($1 ~ /^#Up/) { sub("#","",$1); print "#Chrom","Start","End",$0 } else { print } } else { split($2,a,":"); split(a[2],b,"-"); if (length(b) == 2) { print a[1],b[1],b[2],$0 } else { print a[1],b[1],b[1],$0 } }}' ) --force_overwrite 

The above works perfectly with ensembl-tools-release-87 release from the ensembl vep website.

However, with the github release/87 branch, the VEP line fails with the following:

-------------------- EXCEPTION --------------------
MSG: ERROR: SIFT not available

STACK Bio::EnsEMBL::VEP::AnnotationSource::Cache::Transcript::check_sift_polyphen /mnt/254b78b9-76b4-422d-84b1-cc632bff60f7/IsmailM/.progz/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/Cache/Transcript.pm:96
STACK Bio::EnsEMBL::VEP::AnnotationSource::Cache::Transcript::new /mnt/254b78b9-76b4-422d-84b1-cc632bff60f7/IsmailM/.progz/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/Cache/Transcript.pm:70
STACK Bio::EnsEMBL::VEP::CacheDir::get_all_AnnotationSources /mnt/254b78b9-76b4-422d-84b1-cc632bff60f7/IsmailM/.progz/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:95
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all_from_cache /mnt/254b78b9-76b4-422d-84b1-cc632bff60f7/IsmailM/.progz/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:81
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all /mnt/254b78b9-76b4-422d-84b1-cc632bff60f7/IsmailM/.progz/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:63
STACK Bio::EnsEMBL::VEP::BaseRunner::get_all_AnnotationSources /mnt/254b78b9-76b4-422d-84b1-cc632bff60f7/IsmailM/.progz/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:119
STACK Bio::EnsEMBL::VEP::Runner::init /mnt/254b78b9-76b4-422d-84b1-cc632bff60f7/IsmailM/.progz/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:77
STACK Bio::EnsEMBL::VEP::Runner::run /mnt/254b78b9-76b4-422d-84b1-cc632bff60f7/IsmailM/.progz/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:111
STACK toplevel /mnt/254b78b9-76b4-422d-84b1-cc632bff60f7/IsmailM/.progz/ensembl-vep/vep.pl:193
Date (localtime)    = Tue Mar 28 00:27:08 2017
Ensembl API version = 87
---------------------------------------------------

Is this because of changes to ensembl-vep or simply because CADD links to an older annotation db?

sarahhunt commented 7 years ago

Hi @IsmailM

We changed the way we handle metadata in the cache late in 2014 and this is now required when checking what data is available. Release 75 was before this change, so the metadata is not available. Do you need to use cache version 75? GRCh37 is supported in later releases.

IsmailM commented 7 years ago

Many thanks for clarifying that. That's useful to know.

I also fired off an email to CADD support - they said:

CADD needs to be run with the "ancient" version of VEP and the corresponding cache. It has been trained on this exact version of annotations and scores. Results will not be valid otherwise.

Thus I need to install and run ensembl-tools-release-76 just for CADD :(

The BioPerl URL is incorrect in the INSTALL.pl in that branch - see here. It should be something like:

$BIOPERL_URL  ||= 'https://cpan.metacpan.org/authors/id/C/CJ/CJFIELDS/BioPerl-1.6.1.tar.gz';

Any chance you could update that so that others don't have the same difficulty.

Many Thanks Ismail M

cegg commented 5 years ago

Hello, I am getting the same error with API verison 94 and data version 92. I am on Mac OSX 10.11.6 el Capital, and VEP was installed by this doc: https://gist.github.com/ckandoth/5390e3ae4ecf182fa92f6318cfa9fa97

$ ./vep --species homo_sapiens --assembly GRCh37 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/$VER\_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file examples/homo_sapiens_GRCh37.vcf --output_file examples/homo_sapiens_GRCh37.vep.vcf --polyphen b --af --af_1kg --af_esp --regulatory

-------------------- EXCEPTION --------------------
MSG: ERROR: SIFT not available

STACK Bio::EnsEMBL::VEP::AnnotationSource::Cache::Transcript::check_sift_polyphen /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/Cache/Transcript.pm:168
STACK Bio::EnsEMBL::VEP::AnnotationSource::Cache::Transcript::new /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/Cache/Transcript.pm:121
STACK Bio::EnsEMBL::VEP::CacheDir::get_all_AnnotationSources /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:142
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all_from_cache /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:121
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:93
STACK Bio::EnsEMBL::VEP::BaseRunner::get_all_AnnotationSources /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:175
STACK Bio::EnsEMBL::VEP::Runner::init /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:123
STACK Bio::EnsEMBL::VEP::Runner::run /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel ./vep:224
Date (localtime)    = Wed Nov  7 09:46:49 2018
Ensembl API version = 94
---------------------------------------------------
at7 commented 5 years ago

Hi, vep checks if sift is available by reading the info.txt file in the cache directory. Could you please send me the content of that file? The file is under path_to_your_cache_directory/homo_sapiens/92_GRCh37/info.txt Thanks, Anja

cegg commented 5 years ago
$ cat .vep/homo_sapiens/92_GRCh37/info.txt
species homo_sapiens
assembly    GRCh37
sift    b
polyphen    b
source_polyphen 2.2.2
source_sift sift5.2.2
source_genebuild    2011-04
source_gencode  GENCODE 19
source_assembly GRCh37.p13
variation_cols  variation_name,failed,somatic,start,end,allele_string,strand,minor_allele,minor_allele_freq,clin_sig,phenotype_or_disease,pubmed,AFR,AMR,EAS,EUR,SAS,AA,EA,gnomAD,gnomAD_AFR,gnomAD_AMR,gnomAD_ASJ,gnomAD_EAS,gnomAD_FIN,gnomAD_NFE,gnomAD_OTH,gnomAD_SAS
source_COSMIC   81
source_HGMD-PUBLIC  20164
source_ESP  20141103
source_ClinVar  201706
source_dbSNP    150
source_1000genomes  phase3
source_gnomAD   170228
regulatory  1
cell_types  A549,Aorta,B_cells_(PB)_Roadmap,CD14+CD16-_monocyte_(CB),CD14+CD16-_monocyte_(VB),CD4+_ab_T_cell_(VB),CD8+_ab_T_cell_(CB),CM_CD4+_ab_T_cell_(VB),DND-41,EPC_(VB),Fetal_Adrenal_Gland,Fetal_Intestine_Large,Fetal_Intestine_Small,Fetal_Muscle_Leg,Fetal_Muscle_Trunk,Fetal_Stomach,Fetal_Thymus,GM12878,Gastric,H1-mesenchymal,H1-neuronal_progenitor,H1-trophoblast,H1ESC,H9,HMEC,HSMM,HSMMtube,HUVEC,HUVEC_prol_(CB),HeLa-S3,HepG2,IMR90,K562,Left_Ventricle,Lung,M0_macrophage_(CB),M0_macrophage_(VB),M1_macrophage_(CB),M1_macrophage_(VB),M2_macrophage_(CB),M2_macrophage_(VB),MSC_(VB),Monocytes-CD14+,Monocytes-CD14+_(PB)_Roadmap,NH-A,NHDF-AD,NHEK,NHLF,Natural_Killer_cells_(PB),Osteobl,Ovary,Pancreas,Placenta,Psoas_Muscle,Right_Atrium,Small_Intestine,Spleen,T_cells_(PB)_Roadmap,Thymus,eosinophil_(VB),erythroblast_(CB),iPS-20b,iPS_DF_19.11,iPS_DF_6.9,naive_B_cell_(VB),neutrophil_(CB),neutrophil_(VB),neutrophil_myelocyte_(BM)
source_regbuild 1.0
at7 commented 5 years ago

OK. That looks good. Is there a reason why you are running API version with cache version 92? We do recommend to use the same cache version as API version. Can you please try running with --cache_version 92?

cegg commented 5 years ago

No particular reason, just trying not to deviate from instructions in https://gist.github.com/ckandoth/5390e3ae4ecf182fa92f6318cfa9fa97 which uses VER=92 In general, I don't really need VEP, I am installing it as a pre-req for vcf2maf which I am after ( https://github.com/mskcc/vcf2maf/blob/master/README.md ). So to get datasets of version 94 I would need set export VER=94, then download all the files with 94 in the name instead of 92?

at7 commented 5 years ago

Sorry, but we need to take a step back. If you followed the installation instructions why are using the API version 94? The error you get indicates that VEP can read the info file in the cache directory. However this also means that you have cache files for version 94 installed? Otherwise I would expect an error like MSG: ERROR: No cache found for homo_sapiens, version 94. So I would recommend that you check your versions of API and cache files and use the same version for both for your analysis.

cegg commented 5 years ago

I am not sure if I am using API version 94, I think that because of the error message I am getting Ensembl API version = 94 (see above where I started) . Is there a command to get the version? -v or --version do not give me that.

at7 commented 5 years ago

I see. Can you please go into your ensembl-vep directory and type perl INSTALL -help. This should print the version at the top of the output. Thanks

cegg commented 5 years ago

Here it is:

$ perl INSTALL.pl -help
#---------------#
# VEP INSTALLER #
#---------------#

versions
  ensembl              : 94.5c08d90
  ensembl-funcgen      : 94.08b0c13
  ensembl-io           : 94.8d53275
  ensembl-variation    : 94.066b102
  ensembl-vep          : 94.5

BTW, I am downloading datasets v.94 right now, but that will take couple of hours...

at7 commented 5 years ago

It is good that you have version 94 installed. I'm just wondering how you got there if you followed the instructions here: https://gist.github.com/ckandoth/5390e3ae4ecf182fa92f6318cfa9fa97 Is it also possible that you have 94 cache files? Do you have any directories called 94_GRCh37? That should be under path_to_your_cache_directory/homo_sapiens/. If not I would suggest that you manually download ftp://ftp.ensembl.org/pub/release-94/variation/VEP/homo_sapiens_vep_94_GRCh37.tar.gz and run tar xzf homo_sapiens_vep_94_GRCh37.tar.gz After those steps test your script again. If the error disappears you can run the convert_cache script as described in your instructions. Let me know how it goes.

at7 commented 5 years ago

Did by any chance running your command with --cache_version 92 work?

cegg commented 5 years ago

adding --cache_version 92 does not change anything

Now I downloaded 94 cache files and ran again the following:

rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-$VER/variation/VEP/homo_sapiens_vep_$VER\_GRCh37.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-$VER/variation/VEP/homo_sapiens_vep_$VER\_GRCh38.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-$VER/variation/VEP/mus_musculus_vep_$VER\_GRCm38.tar.gz $VEP_DATA
cat $VEP_DATA/*_vep_$VER\_GRC{h37,h38,m38}.tar.gz | tar -izxf - -C $VEP_DATA
perl INSTALL.pl --AUTO a --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA --NO_HTSLIB
perl INSTALL.pl --AUTO f --SPECIES homo_sapiens --ASSEMBLY GRCh37 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO f --SPECIES homo_sapiens --ASSEMBLY GRCh38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO f --SPECIES mus_musculus --ASSEMBLY GRCm38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA

The error changed:

$ ./vep --species mus_musculus --assembly GRCm38 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/mus_musculus/$VER\_GRCm38/Mus_musculus.GRCm38.dna.toplevel.fa.gz --input_file examples/mus_musculus_GRCm38.vcf --output_file examples/mus_musculus_GRCm38.vep.vcf --regulatory

-------------------- EXCEPTION --------------------
MSG: ERROR: Cannot index bgzipped FASTA file with Bio::DB::Fasta

STACK Bio::EnsEMBL::Variation::Utils::FastaSequence::setup_fasta /Users/ipozdnya/ensembl-vep/Bio/EnsEMBL/Variation/Utils/FastaSequence.pm:210
STACK Bio::EnsEMBL::VEP::BaseVEP::fasta_db /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseVEP.pm:475
STACK Bio::EnsEMBL::VEP::Runner::init /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:129
STACK Bio::EnsEMBL::VEP::Runner::run /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel ./vep:224
Date (localtime)    = Thu Nov  8 19:09:09 2018
Ensembl API version = 94
at7 commented 5 years ago

Can you please unzip your FASTA file and just gzip it? I noticed that you switched species to --species mus_musculus. Did you also install the cache files for mouse?

cegg commented 5 years ago
  1. switching to mouse does not matter, yes, the mouse cache file was downloaded and indexed. The test for human (below) gives the same error about cannot index bgzipped FASTA file with Bio::DB::Fasta ./vep --species homo_sapiens --assembly GRCh37 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/$VER\_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file examples/homo_sapiens_GRCh37.vcf --output_file examples/homo_sapiens_GRCh37.vep.vcf --polyphen b --af --af_1kg --af_esp --regulatory
  2. Let me make sure I understood the suggestion correctly: I should unzip and untar homo_sapiens_vep_94_GRCh37.tar.gz then only zip it without tar ? And then re-run tabix on the new file?
  3. I found another suggestion in regard of that cannot index bgzipped FASTA file with Bio::DB::Fasta error: [(https://github.com/mskcc/vcf2maf/issues/97)] . It suggests that a newer version of perl is required but that would be a dead end for me as I can not control perl version in all the environments involved.
at7 commented 5 years ago

It sounds like your FASTA file is bgzipped: Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz I suggest that you unzip it and then just use gzip for compressing the FASTA file again.

cegg commented 5 years ago

bzip error seems to be a manifestation of some other problem, not use of bzip2. Might be that perl version thing after all First, the extension is .gz, not .bzip. Second, upzipping and zipping again did not change anything - same bzip complain Third, checkng the file tells me it's a regular gzipped:

$ file Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz: gzip compressed data, was "Homo_sapiens.GRCh37.75.dna.prim", from Unix, last modified: Wed Nov  7 17:08:27 2018

When I try to run the vf2maf test (which is the goal of my whole activity) with --ref-fasta specified in the command line, there is another error:

$ perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf --ref-fasta Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
`[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip`
[faidx] Could not load fai index of /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 1:11290178 from reference FASTA: /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
..several more similar warnings...
WARNING: Couldn't retrieve bps around 17:7579312 from reference FASTA: /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
STATUS: Running VEP and writing to: tests/test.vep.vcf
ERROR: Cannot find VEP script in path: /Users/ipozdnya/vep

Is it possible that my version of vcf2maf wants fasta to be bzipped while VEP wants it gzipped?

at7 commented 5 years ago

You can use bgzipped FASTA files with VEP if you have Bio::DB::HTS::Faidx installed. The code and installation instructions are here: https://github.com/Ensembl/Bio-DB-HTS/tree/release/v2.11

cegg commented 5 years ago

I figure out the error ERROR: Cannot find VEP script in path: /Users/ipozdnya/vep from vcf2maf test run from the previous post: vcf2maf.pl expects vep executable to be in vep subdir my ( $vep_path, $vep_data, $vep_forks, $buffer_size, $any_allele ) = ( "$ENV{HOME}/ensembl-vep", "$ENV{HOME}/.vep", 4, 5000, 0 ); while on my machine vep gets installed in subdir ensemble-vep. After correcting the script to point to ensembl-vep, I started to get the same error about bzip from vcf2maf.pl run as well:

$ perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf --ref-fasta /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip
[faidx] Could not load fai index of /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 1:11290178 from reference FASTA: /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
..bunch of similar warnings...
WARNING: Couldn't retrieve bps around 17:7579312 from reference FASTA: /Users/ipozdnya/.vep/homo_sapiens/94_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
STATUS: Running VEP and writing to: tests/test.vep.vcf

-------------------- EXCEPTION --------------------
**MSG: ERROR: Cannot index bgzipped FASTA file with Bio::DB::Fasta**

STACK Bio::EnsEMBL::Variation::Utils::FastaSequence::setup_fasta /Users/ipozdnya/ensembl-vep/Bio/EnsEMBL/Variation/Utils/FastaSequence.pm:210
STACK Bio::EnsEMBL::VEP::BaseVEP::fasta_db /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseVEP.pm:475
STACK Bio::EnsEMBL::VEP::Runner::init /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:129
STACK Bio::EnsEMBL::VEP::Runner::run /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel /Users/ipozdnya/ensembl-vep/vep:224
Date (localtime)    = Fri Nov  9 12:46:26 2018
Ensembl API version = 94
---------------------------------------------------

ERROR: Failed to run the VEP annotator!