Closed Stikus closed 4 years ago
Looks like I found the reason of this issue.
According to VEP Cache page all indexed cache is here:
ftp://ftp.ensembl.org/pub/release-98/variation/indexed_vep_cache/
And all non-indexed is here:
ftp://ftp.ensembl.org/pub/release-98/variation/vep/
But default FTP Cache URL for VEP from this line is:
$CACHE_URL ||= "ftp://ftp.ensembl.org/pub/release-$DATA_VERSION/variation/vep";
And it is pointing on not indexed cache.
Am I correct?
Btw - about info files. According to Cache page GRCh37 have dbsnp v152, but according to my info txt above - it have dnsnp v151. And for GRCh38 - my info.txt says:
# CACHE UPDATED 2019-09-30 14:47:57
species homo_sapiens
assembly GRCh38
sift b
polyphen b
source_polyphen 2.2.2
source_sift sift5.2.2
source_genebuild 2014-07
source_gencode GENCODE 32
source_assembly GRCh38.p13
variation_cols chr,variation_name,failed,somatic,start,end,allele_string,strand,minor_allele,minor_allele_freq,clin_sig,phenotype_or_disease,clin_sig_allele,pubmed,AFR,AMR,EAS,EUR,SAS,AA,EA,gnomAD,gnomAD_AFR,gnomAD_AMR,gnomAD_ASJ,gnomAD_EAS,gnomAD_FIN,gnomAD_NFE,gnomAD_OTH,gnomAD_SAS
source_COSMIC 89
source_HGMD-PUBLIC 20184
source_ESP 20141103
source_ClinVar 201907
source_dbSNP 152
source_1000genomes phase3
source_ESP V2-SSA137
source_gnomAD r2.1
regulatory 1
cell_types A549,A673,B,B_(PB),CD14+_monocyte_(PB),CD14+_monocyte_1,CD4+_CD25+_ab_Treg_(PB),CD4+_ab_T,CD4+_ab_T_(PB)_1,CD4+_ab_T_(PB)_2,CD4+_ab_T_(Th),CD4+_ab_T_(VB),CD8+_ab_T_(CB),CD8+_ab_T_(PB),CMP_CD4+_1,CMP_CD4+_2,CMP_CD4+_3,CM_CD4+_ab_T_(VB),DND-41,EB_(CB),EM_CD4+_ab_T_(PB),EM_CD8+_ab_T_(VB),EPC_(VB),GM12878,H1-hESC_2,H1-hESC_3,H9_1,HCT116,HSMM,HUES48,HUES6,HUES64,HUVEC,HUVEC-prol_(CB),HeLa-S3,HepG2,K562,M0_(CB),M0_(VB),M1_(CB),M1_(VB),M2_(CB),M2_(VB),MCF-7,MM.1S,MSC,MSC_(VB),NHLF,NK_(PB),NPC_1,NPC_2,NPC_3,PC-3,PC-9,SK-N.,T_(PB),Th17,UCSF-4,adrenal_gland,aorta,astrocyte,bipolar_neuron,brain_1,cardiac_muscle,dermal_fibroblast,endodermal,eosinophil_(VB),esophagus,foreskin_fibroblast_2,foreskin_keratinocyte_1,foreskin_keratinocyte_2,foreskin_melanocyte_1,foreskin_melanocyte_2,germinal_matrix,heart,hepatocyte,iPS-15b,iPS-20b,iPS_DF_19.11,iPS_DF_6.9,keratinocyte,kidney,large_intestine,left_ventricle,leg_muscle,lung_1,lung_2,mammary_epithelial_1,mammary_epithelial_2,mammary_myoepithelial,monocyte_(CB),monocyte_(VB),mononuclear_(PB),myotube,naive_B_(VB),neuron,neurosphere_(C),neurosphere_(GE),neutro_myelocyte,neutrophil_(CB),neutrophil_(VB),osteoblast,ovary,pancreas,placenta,psoas_muscle,right_atrium,right_ventricle,sigmoid_colon,small_intestine_1,small_intestine_2,spleen,stomach_1,stomach_2,thymus_1,thymus_2,trophoblast,trunk_muscle
source_regbuild 1.0
var_type tabix
So source_gencode GENCODE 32
and source_assembly GRCh38.p13
, but Cache page says Genome assembly GRCh38.p12
and GENCODE 31
. Looks like something wrong somewhere.
Hi,
Thank you for this report. You’re correct, the changes that we made to make the VEP installer default to using the indexed cache files are missing from release 98, and they shouldn’t be. I’ll let you know when a fix is in place.
And as you’ve also found, you can download the indexed cache file directly from ftp://ftp.ensembl.org/pub/release-98/variation/indexed_vep_cache/ ftp://ftp.ensembl.org/pub/release-98/variation/indexed_vep_cache/ in the meantime.
Kind Regards, Andrew
On 30 Sep 2019, at 13:00, Grammatikati Konstantin notifications@github.com wrote:
Looks like I found the reason of this issue. According to VEP Cache page http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#cache all indexed cache is here: ftp://ftp.ensembl.org/pub/release-98/variation/indexed_vep_cache/ And all non-indexed is here: ftp://ftp.ensembl.org/pub/release-98/variation/vep/
But default FTP Cache URL for VEP from this https://github.com/Ensembl/ensembl-vep/blob/release/98/INSTALL.pl#L212 line is: $CACHE_URL ||= "ftp://ftp.ensembl.org/pub/release-$DATA_VERSION/variation/vep"; And it is pointing on not indexed cache. Am I correct?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Ensembl/ensembl-vep/issues/610?email_source=notifications&email_token=AH56GNZIXMMFDENOCNYJYQTQMHS5XA5CNFSM4I3ZT3FKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD75MWOQ#issuecomment-536529722, or mute the thread https://github.com/notifications/unsubscribe-auth/AH56GNZ2S5PTBHI4HOQLLGTQMHS5XANCNFSM4I3ZT3FA.
Thank you for answer and fast update.
But what about --CONVERT
option? It is still present in INSTALL.pl, but now it have no effect at all.
Should it be removed maybe?
Hi,
Thanks for reporting this issue to us, the installer has now been updated and will correctly install the indexed versions of the cache files.
Regarding --CONVERT, while we've switched to installing tabix-indexed cache files by default, then the installer can still be used to install non-indexed cache files. We also want to provide users with a mechanism for converting old cache files if they choose, so we don't want to remove the flag entirely.
Kind Regards, Andrew
But you removed this code block and this line respectively:
system("perl $dirname/convert_cache.pl --dir $CACHE_DIR --species $species --version $DATA_VERSION\_$assembly --bgzip $bgzip --tabix $tabix") == 0 or print STDERR "WARNING: Failed to run convert script\n";
In currect (98.2) release there are only 2 occurencies of this option:
line72:
$CONVERT,
line167
'CONVERT|t' => \$CONVERT,
And in usage
So for now --CONVERT
does nothing according to code if I'm not mistaken.
Hi,
Thanks for this report. You're correct, I've submitted a PR to adjust this.
Kind Regards, Andrew
Thanks for answers and for your work
Hello. If I understand correctly release notes of 97 VEP release - now cache files in tabix indexed already and cache convertion step not needed anymore. Release notes VEP 97
Related issue
But when I updated out VEP installation to 98 version (this problem was at 97 version too btw) conversion still started and too several ours both for 37 and 38 genome.
Context: Our VEP installation command:
perl INSTALL.pl --NO_TEST --NO_UPDATE --NO_HTSLIB --NO_BIOPERL -a ap --PLUGINSDIR "$SOFT/ensembl-vep-${VEP_VERSION}/Plugins" --PLUGINS ProteinSeqs,Downstream,Conservation,GO,G2P
Our VEP cache install command:
$VEPINSTALL --NO_TEST --NO_UPDATE --NO_HTSLIB --NO_BIOPERL -a cf -s homo_sapiens -y $vepCacheAssembly --CONVERT -c $VEPCACHEDIR
If I understand perl script correctly - this line starts convert_cache.pl and this line is check for need of convertion. So if info.txt contains key:
var_type tabix
- variant is skipped.But this is our 98_GRCh37 info's before and after convertion: Before:
After:
Looks like info.txt doesn't contain info about that files are already tabix indexed. Do I missunderstood something on this is wrong?