Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
445 stars 151 forks source link

Cache convertion for 98 release #610

Closed Stikus closed 4 years ago

Stikus commented 4 years ago

Hello. If I understand correctly release notes of 97 VEP release - now cache files in tabix indexed already and cache convertion step not needed anymore. Release notes VEP 97

tabix-indexed variant cache files are now installed by default.

Related issue

But when I updated out VEP installation to 98 version (this problem was at 97 version too btw) conversion still started and too several ours both for 37 and 38 genome.

Context: Our VEP installation command: perl INSTALL.pl --NO_TEST --NO_UPDATE --NO_HTSLIB --NO_BIOPERL -a ap --PLUGINSDIR "$SOFT/ensembl-vep-${VEP_VERSION}/Plugins" --PLUGINS ProteinSeqs,Downstream,Conservation,GO,G2P

Our VEP cache install command: $VEPINSTALL --NO_TEST --NO_UPDATE --NO_HTSLIB --NO_BIOPERL -a cf -s homo_sapiens -y $vepCacheAssembly --CONVERT -c $VEPCACHEDIR

If I understand perl script correctly - this line starts convert_cache.pl and this line is check for need of convertion. So if info.txt contains key: var_type tabix - variant is skipped.

But this is our 98_GRCh37 info's before and after convertion: Before:

species homo_sapiens
assembly    GRCh37
sift    b
polyphen    b
source_polyphen 2.2.2
source_sift sift5.2.2
source_genebuild    2011-04
source_gencode  GENCODE 19
source_assembly GRCh37.p13
variation_cols  variation_name,failed,somatic,start,end,allele_string,strand,minor_allele,minor_allele_freq,clin_sig,phenotype_or_disease,clin_sig_allele,pubmed,AFR,AMR,EAS,EUR,SAS,AA,EA,gnomAD,gnomAD_AFR,gnomAD_AMR,gnomAD_ASJ,gnomAD_EAS,gnomAD_FIN,gnomAD_NFE,gnomAD_OTH,gnomAD_SAS
source_COSMIC   86
source_HGMD-PUBLIC  20174
source_ESP  20141103
source_ClinVar  201810
source_dbSNP    151
source_1000genomes  phase3
source_gnomAD   r2.1
regulatory  1
cell_types  A549,A549_ENCSR797CXN,A673,Aorta,B_cell_ENCSR682AXR,B_cells_(PB)_Roadmap,CD14+CD16-_monocyte_(CB),CD14+CD16-_monocyte_(VB),CD14_positive_monocyte,CD38-_naive_B_cell_(VB),CD4+_ab_T_cell_(CB),CD4+_ab_T_cell_(VB),CD4_positive_CD25_positive_alpha_beta_regulatory_T_cell,CD4_positive_alpha_beta_T_cell,CD4_positive_alpha_beta_T_cell_ENCSR948ZKZ,CD4_positive_alpha_beta_memory_T_cell,CD8+_ab_T_cell_(CB),CM_CD4+_ab_T_cell_(VB),DND-41,EM_CD8+_ab_T_cell_(VB),EPC_(VB),Fetal_Adrenal_Gland,Fetal_Intestine_Large,Fetal_Intestine_Small,Fetal_Muscle_Leg,Fetal_Muscle_Trunk,Fetal_Stomach,Fetal_Thymus,GM12878,GM12878_ENCSR447YYN,Gastric,H1-mesenchymal,H1-neuronal_progenitor,H1-trophoblast,H1ESC,H1_hESC,H1_hESC_ENCSR820QMS,H9,H9_ENCSR323FKB,HCT116,HMEC,HSMM,HSMMtube,HUES48,HUES6,HUES64,HUVEC,HUVEC_prol_(CB),HepG2,IMR90,IMR_90,K562,Karpas_422,Left_Ventricle,Lung,M0_macrophage_(CB),M0_macrophage_(VB),M1_macrophage_(CB),M1_macrophage_(VB),M2_macrophage_(CB),M2_macrophage_(VB),MCF_7,MM_1S,MSC_(VB),Monocytes-CD14+,Monocytes-CD14+_(PB)_Roadmap,NH-A,NHDF-AD,NHEK,NHLF,Natural_Killer_cells_(PB),Osteobl,Ovary,PC_3,PC_9,Pancreas,Placenta,Psoas_Muscle,Right_Atrium,SK_N_SH,Small_Intestine,Spleen,T_cells_(PB)_Roadmap,T_helper_17_cell,Thymus,astrocyte,bipolar_neuron,brain,cardiac_muscle_cell,common_myeloid_progenitor_CD34_positive,common_myeloid_progenitor_CD34_positive_ENCSR337XXD_1,common_myeloid_progenitor_CD34_positive_ENCSR722JRY,effector_memory_CD4_positive_alpha_beta_T_cell,endodermal_cell,endothelial_cell_of_umbilical_vein,eosinophil_(VB),erythroblast_(CB),esophagus,fibroblast_of_dermis,fibroblast_of_lung,heart,heart_right_ventricle,hepatocyte,iPS_15b,iPS_DF_19.11,iPS_DF_6.9,keratinocyte,kidney,lung_ENCSR465WKM,mammary_epithelial_cell,myotube,naive_B_cell_(To),naive_B_cell_(VB),naive_thymus_derived_CD4_positive_alpha_beta_T_cell,neural_progenitor_cell,neural_stem_progenitor_cell,neuron,neutrophil,neutrophil_(CB),neutrophil_(VB),neutrophil_myelocyte_(BM),sigmoid_colon,skeletal_muscle_myoblast
source_regbuild 1.0

After:

# CACHE UPDATED 2019-09-30 14:07:50
species homo_sapiens
assembly    GRCh37
sift    b
polyphen    b
source_polyphen 2.2.2
source_sift sift5.2.2
source_genebuild    2011-04
source_gencode  GENCODE 19
source_assembly GRCh37.p13
variation_cols  chr,variation_name,failed,somatic,start,end,allele_string,strand,minor_allele,minor_allele_freq,clin_sig,phenotype_or_disease,clin_sig_allele,pubmed,AFR,AMR,EAS,EUR,SAS,AA,EA,gnomAD,gnomAD_AFR,gnomAD_AMR,gnomAD_ASJ,gnomAD_EAS,gnomAD_FIN,gnomAD_NFE,gnomAD_OTH,gnomAD_SAS
source_COSMIC   86
source_HGMD-PUBLIC  20174
source_ESP  20141103
source_ClinVar  201810
source_dbSNP    151
source_1000genomes  phase3
source_gnomAD   r2.1
regulatory  1
cell_types  A549,A549_ENCSR797CXN,A673,Aorta,B_cell_ENCSR682AXR,B_cells_(PB)_Roadmap,CD14+CD16-_monocyte_(CB),CD14+CD16-_monocyte_(VB),CD14_positive_monocyte,CD38-_naive_B_cell_(VB),CD4+_ab_T_cell_(CB),CD4+_ab_T_cell_(VB),CD4_positive_CD25_positive_alpha_beta_regulatory_T_cell,CD4_positive_alpha_beta_T_cell,CD4_positive_alpha_beta_T_cell_ENCSR948ZKZ,CD4_positive_alpha_beta_memory_T_cell,CD8+_ab_T_cell_(CB),CM_CD4+_ab_T_cell_(VB),DND-41,EM_CD8+_ab_T_cell_(VB),EPC_(VB),Fetal_Adrenal_Gland,Fetal_Intestine_Large,Fetal_Intestine_Small,Fetal_Muscle_Leg,Fetal_Muscle_Trunk,Fetal_Stomach,Fetal_Thymus,GM12878,GM12878_ENCSR447YYN,Gastric,H1-mesenchymal,H1-neuronal_progenitor,H1-trophoblast,H1ESC,H1_hESC,H1_hESC_ENCSR820QMS,H9,H9_ENCSR323FKB,HCT116,HMEC,HSMM,HSMMtube,HUES48,HUES6,HUES64,HUVEC,HUVEC_prol_(CB),HepG2,IMR90,IMR_90,K562,Karpas_422,Left_Ventricle,Lung,M0_macrophage_(CB),M0_macrophage_(VB),M1_macrophage_(CB),M1_macrophage_(VB),M2_macrophage_(CB),M2_macrophage_(VB),MCF_7,MM_1S,MSC_(VB),Monocytes-CD14+,Monocytes-CD14+_(PB)_Roadmap,NH-A,NHDF-AD,NHEK,NHLF,Natural_Killer_cells_(PB),Osteobl,Ovary,PC_3,PC_9,Pancreas,Placenta,Psoas_Muscle,Right_Atrium,SK_N_SH,Small_Intestine,Spleen,T_cells_(PB)_Roadmap,T_helper_17_cell,Thymus,astrocyte,bipolar_neuron,brain,cardiac_muscle_cell,common_myeloid_progenitor_CD34_positive,common_myeloid_progenitor_CD34_positive_ENCSR337XXD_1,common_myeloid_progenitor_CD34_positive_ENCSR722JRY,effector_memory_CD4_positive_alpha_beta_T_cell,endodermal_cell,endothelial_cell_of_umbilical_vein,eosinophil_(VB),erythroblast_(CB),esophagus,fibroblast_of_dermis,fibroblast_of_lung,heart,heart_right_ventricle,hepatocyte,iPS_15b,iPS_DF_19.11,iPS_DF_6.9,keratinocyte,kidney,lung_ENCSR465WKM,mammary_epithelial_cell,myotube,naive_B_cell_(To),naive_B_cell_(VB),naive_thymus_derived_CD4_positive_alpha_beta_T_cell,neural_progenitor_cell,neural_stem_progenitor_cell,neuron,neutrophil,neutrophil_(CB),neutrophil_(VB),neutrophil_myelocyte_(BM),sigmoid_colon,skeletal_muscle_myoblast
source_regbuild 1.0
var_type    tabix

Looks like info.txt doesn't contain info about that files are already tabix indexed. Do I missunderstood something on this is wrong?

Stikus commented 4 years ago

Looks like I found the reason of this issue. According to VEP Cache page all indexed cache is here: ftp://ftp.ensembl.org/pub/release-98/variation/indexed_vep_cache/ And all non-indexed is here: ftp://ftp.ensembl.org/pub/release-98/variation/vep/

But default FTP Cache URL for VEP from this line is: $CACHE_URL ||= "ftp://ftp.ensembl.org/pub/release-$DATA_VERSION/variation/vep"; And it is pointing on not indexed cache. Am I correct?

Btw - about info files. According to Cache page GRCh37 have dbsnp v152, but according to my info txt above - it have dnsnp v151. And for GRCh38 - my info.txt says:

# CACHE UPDATED 2019-09-30 14:47:57
species homo_sapiens
assembly    GRCh38
sift    b
polyphen    b
source_polyphen 2.2.2
source_sift sift5.2.2
source_genebuild    2014-07
source_gencode  GENCODE 32
source_assembly GRCh38.p13
variation_cols  chr,variation_name,failed,somatic,start,end,allele_string,strand,minor_allele,minor_allele_freq,clin_sig,phenotype_or_disease,clin_sig_allele,pubmed,AFR,AMR,EAS,EUR,SAS,AA,EA,gnomAD,gnomAD_AFR,gnomAD_AMR,gnomAD_ASJ,gnomAD_EAS,gnomAD_FIN,gnomAD_NFE,gnomAD_OTH,gnomAD_SAS
source_COSMIC   89
source_HGMD-PUBLIC  20184
source_ESP  20141103
source_ClinVar  201907
source_dbSNP    152
source_1000genomes  phase3
source_ESP  V2-SSA137
source_gnomAD   r2.1
regulatory  1
cell_types  A549,A673,B,B_(PB),CD14+_monocyte_(PB),CD14+_monocyte_1,CD4+_CD25+_ab_Treg_(PB),CD4+_ab_T,CD4+_ab_T_(PB)_1,CD4+_ab_T_(PB)_2,CD4+_ab_T_(Th),CD4+_ab_T_(VB),CD8+_ab_T_(CB),CD8+_ab_T_(PB),CMP_CD4+_1,CMP_CD4+_2,CMP_CD4+_3,CM_CD4+_ab_T_(VB),DND-41,EB_(CB),EM_CD4+_ab_T_(PB),EM_CD8+_ab_T_(VB),EPC_(VB),GM12878,H1-hESC_2,H1-hESC_3,H9_1,HCT116,HSMM,HUES48,HUES6,HUES64,HUVEC,HUVEC-prol_(CB),HeLa-S3,HepG2,K562,M0_(CB),M0_(VB),M1_(CB),M1_(VB),M2_(CB),M2_(VB),MCF-7,MM.1S,MSC,MSC_(VB),NHLF,NK_(PB),NPC_1,NPC_2,NPC_3,PC-3,PC-9,SK-N.,T_(PB),Th17,UCSF-4,adrenal_gland,aorta,astrocyte,bipolar_neuron,brain_1,cardiac_muscle,dermal_fibroblast,endodermal,eosinophil_(VB),esophagus,foreskin_fibroblast_2,foreskin_keratinocyte_1,foreskin_keratinocyte_2,foreskin_melanocyte_1,foreskin_melanocyte_2,germinal_matrix,heart,hepatocyte,iPS-15b,iPS-20b,iPS_DF_19.11,iPS_DF_6.9,keratinocyte,kidney,large_intestine,left_ventricle,leg_muscle,lung_1,lung_2,mammary_epithelial_1,mammary_epithelial_2,mammary_myoepithelial,monocyte_(CB),monocyte_(VB),mononuclear_(PB),myotube,naive_B_(VB),neuron,neurosphere_(C),neurosphere_(GE),neutro_myelocyte,neutrophil_(CB),neutrophil_(VB),osteoblast,ovary,pancreas,placenta,psoas_muscle,right_atrium,right_ventricle,sigmoid_colon,small_intestine_1,small_intestine_2,spleen,stomach_1,stomach_2,thymus_1,thymus_2,trophoblast,trunk_muscle
source_regbuild 1.0
var_type    tabix

So source_gencode GENCODE 32 and source_assembly GRCh38.p13, but Cache page says Genome assembly GRCh38.p12 and GENCODE 31. Looks like something wrong somewhere.

aparton commented 4 years ago

Hi,

Thank you for this report. You’re correct, the changes that we made to make the VEP installer default to using the indexed cache files are missing from release 98, and they shouldn’t be. I’ll let you know when a fix is in place.

And as you’ve also found, you can download the indexed cache file directly from ftp://ftp.ensembl.org/pub/release-98/variation/indexed_vep_cache/ ftp://ftp.ensembl.org/pub/release-98/variation/indexed_vep_cache/ in the meantime.

Kind Regards, Andrew

On 30 Sep 2019, at 13:00, Grammatikati Konstantin notifications@github.com wrote:

Looks like I found the reason of this issue. According to VEP Cache page http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#cache all indexed cache is here: ftp://ftp.ensembl.org/pub/release-98/variation/indexed_vep_cache/ And all non-indexed is here: ftp://ftp.ensembl.org/pub/release-98/variation/vep/

But default FTP Cache URL for VEP from this https://github.com/Ensembl/ensembl-vep/blob/release/98/INSTALL.pl#L212 line is: $CACHE_URL ||= "ftp://ftp.ensembl.org/pub/release-$DATA_VERSION/variation/vep"; And it is pointing on not indexed cache. Am I correct?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Ensembl/ensembl-vep/issues/610?email_source=notifications&email_token=AH56GNZIXMMFDENOCNYJYQTQMHS5XA5CNFSM4I3ZT3FKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD75MWOQ#issuecomment-536529722, or mute the thread https://github.com/notifications/unsubscribe-auth/AH56GNZ2S5PTBHI4HOQLLGTQMHS5XANCNFSM4I3ZT3FA.

Stikus commented 4 years ago

Thank you for answer and fast update. But what about --CONVERT option? It is still present in INSTALL.pl, but now it have no effect at all. Should it be removed maybe?

aparton commented 4 years ago

Hi,

Thanks for reporting this issue to us, the installer has now been updated and will correctly install the indexed versions of the cache files.

Regarding --CONVERT, while we've switched to installing tabix-indexed cache files by default, then the installer can still be used to install non-indexed cache files. We also want to provide users with a mechanism for converting old cache files if they choose, so we don't want to remove the flag entirely.

Kind Regards, Andrew

Stikus commented 4 years ago

But you removed this code block and this line respectively: system("perl $dirname/convert_cache.pl --dir $CACHE_DIR --species $species --version $DATA_VERSION\_$assembly --bgzip $bgzip --tabix $tabix") == 0 or print STDERR "WARNING: Failed to run convert script\n";

In currect (98.2) release there are only 2 occurencies of this option: line72: $CONVERT, line167 'CONVERT|t' => \$CONVERT, And in usage

So for now --CONVERT does nothing according to code if I'm not mistaken.

aparton commented 4 years ago

Hi,

Thanks for this report. You're correct, I've submitted a PR to adjust this.

Kind Regards, Andrew

Stikus commented 4 years ago

Thanks for answers and for your work