Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
449 stars 151 forks source link

Performance and Error Issues with VEP v111 Docker Container #1681

Open Ananya-swi opened 4 months ago

Ananya-swi commented 4 months ago

Hi,

We are currently using the VEP v111 Docker container to annotate VCF files. However, we are facing the following issues:

  1. Performance Issue: Annotation of a very small sample VCF file takes an unexpectedly long time, even after adjusting the buffer size (50-10000) and fork parameters (12 and 8).
  2. Used Compute Size is 32vcpu and 64 Gb Ram
  3. Error with --fork Option: Using the --fork option alone results in an error. A screenshot of the error is attached for reference.

image

Previously we were using v106.1. Which is taking only 7 minutes to complete the same file. We also tried with v109.3, which leads to same error.

We would appreciate any guidance or suggestions to resolve these issues. Looking forward to your reply. Thank you in advance for your assistance.

Regards, Ananya Saji Data Engineer (Bioinformatics) Semantic Web Tech Pvt. Ltd.

nuno-agostinho commented 4 months ago

Hi Ananya,

Hope you are having a nice day. Sorry for the inconvenience.

Could you show the command that you are using to run VEP? Thank you.

Cheers, Nuno

Ananya-swi commented 4 months ago

Hi Nuno,

Thank you for your response.

We conducted testing using ARGO.

VEP Version Used: 111 & 109.3

Considering our dependency on AKS and ARGO, our capabilities are constrained to effectively utilizing 5 vCPUs and 25 GB of RAM for configurations with 8 vCPUs and 32 GB RAM, and 13 vCPUs and 55 GB of RAM for configurations with 16 vCPUs and 64 GB RAM.

Here is the command we are using to run VEP:

vep \
--cache --refseq  \
--CACHE_VERSION 109 \
--dir_plugins /opt/vep/.vep/Plugins \
--no_stats \
-i "/home/admin/test/Test.vcf.gz" \
-o "/home/admin/test/Test.txt" \
--symbol --hgvs --hgvsg --variant_class --gene_phenotype \
--flag_pick_allele_gene --canonical --appris --ccds --numbers --total_length --mane \
--sift p --polyphen p \
--fasta  /opt/vep/.vep/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz \
--species homo_sapiens --assembly GRCh37 \
--af --af_gnomad \
--no_escape \
--plugin SpliceAI,snv=/opt/vep/.vep/Grch37/spliceai_scores.raw.snv.hg19.vcf.gz,indel=/opt/vep/.vep/Grch37/spliceai_scores.raw.indel.hg19.vcf.gz \
--plugin NMD \
--dir_plugins /opt/vep/.vep/Plugins \
--plugin dbNSFP,/opt/vep/.vep/Grch37/dbNSFP4.5a_grch37.gz,PROVEAN_pred,LRT_pred,MutationTaster_pred,\
MutationAssessor_pred,FATHMM_pred,fathmm-MKL_coding_pred,M-CAP_pred,fathmm-XF_coding_pred,\
DANN_score,MutPred_score,PrimateAI_pred,Aloft_pred,BayesDel_addAF_pred,LIST-S2_pred,\
MVP_score,Eigen-phred_coding,SiPhy_29way_logOdds,bStatistic,Interpro_domain,MetaLR_pred,\
GTEx_V8_gene,GTEx_V8_tissue,VEST4_score,REVEL_score,AlphaMissense_score \
--offline --tab --fork 5 --force_overwrite ;

In addition, we utilized nine custom files and the plugins pLI, CADD, and dbscSNV, for our annotation.

We encountered the same error with VEP versions 109.3 and 111, whereas VEP version 106.1 completed the annotation in 7 minutes for the same file.

Looking forward to your assistance. Thank you.

Best regards, Ananya

nuno-agostinho commented 4 months ago

Hi @Ananya-swi,

Thanks for sending more information. You seem to be using VEP as expected, so I am not really sure why it is taking so much time.

Some ideas/questions about the performance issues:

Looking forward to your reply.

Best, Nuno

Ananya-swi commented 4 months ago

Hi @nuno-agostinho ,

Thanks for your response.

I've tried the method you suggested, removing the plugins, but I am still encountering the same issue. When I use --buffer_size 50 --fork 8, the script runs but takes a long time to complete. The VCF file used as input only contains SNVs.

Explanation of Using --fork Alone:

I used the VEP command without the --buffer_size flag. Here is the command:

vep \
--cache --refseq  \
--CACHE_VERSION 109 \
--dir_plugins /opt/vep/.vep/Plugins \
--no_stats \
-i "/home/admin/test/Test.vcf.gz" \
-o "/home/admin/test/Test.txt" \
--symbol --hgvs --hgvsg --variant_class --gene_phenotype \
--flag_pick_allele_gene --canonical --appris --ccds --numbers --total_length --mane \
--sift p --polyphen p \
--fasta  /opt/vep/.vep/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz \
--species homo_sapiens --assembly GRCh37 \
--af --af_gnomad \
--no_escape \
--plugin SpliceAI,snv=/opt/vep/.vep/Grch37/spliceai_scores.raw.snv.hg19.vcf.gz,indel=/opt/vep/.vep/Grch37/spliceai_scores.raw.indel.hg19.vcf.gz \
--plugin NMD \
--dir_plugins /opt/vep/.vep/Plugins \
--plugin dbNSFP,/opt/vep/.vep/Grch37/dbNSFP4.5a_grch37.gz,PROVEAN_pred,LRT_pred,MutationTaster_pred,\
MutationAssessor_pred,FATHMM_pred,fathmm-MKL_coding_pred,M-CAP_pred,fathmm-XF_coding_pred,\
DANN_score,MutPred_score,PrimateAI_pred,Aloft_pred,BayesDel_addAF_pred,LIST-S2_pred,\
MVP_score,Eigen-phred_coding,SiPhy_29way_logOdds,bStatistic,Interpro_domain,MetaLR_pred,\
GTEx_V8_gene,GTEx_V8_tissue,VEST4_score,REVEL_score,AlphaMissense_score \
--offline --tab --fork 5 --force_overwrite ;
 Using this command, I received the following error:

image

Using --buffer_size When I added the --buffer_size 50 flag, the script ran but took a long time to execute. Here is the command I used:

vep \
--cache --refseq  \
--CACHE_VERSION 109 \
--dir_plugins /opt/vep/.vep/Plugins \
--no_stats \
-i "/home/admin/test/Test.vcf.gz" \
-o "/home/admin/test/Test.txt" \
--symbol --hgvs --hgvsg --variant_class --gene_phenotype \
--flag_pick_allele_gene --canonical --appris --ccds --numbers --total_length --mane \
--sift p --polyphen p \
--fasta  /opt/vep/.vep/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz \
--species homo_sapiens --assembly GRCh37 \
--af --af_gnomad \
--no_escape \
--plugin SpliceAI,snv=/opt/vep/.vep/Grch37/spliceai_scores.raw.snv.hg19.vcf.gz,indel=/opt/vep/.vep/Grch37/spliceai_scores.raw.indel.hg19.vcf.gz \
--plugin NMD \
--dir_plugins /opt/vep/.vep/Plugins \
--plugin dbNSFP,/opt/vep/.vep/Grch37/dbNSFP4.5a_grch37.gz,PROVEAN_pred,LRT_pred,MutationTaster_pred,\
MutationAssessor_pred,FATHMM_pred,fathmm-MKL_coding_pred,M-CAP_pred,fathmm-XF_coding_pred,\
DANN_score,MutPred_score,PrimateAI_pred,Aloft_pred,BayesDel_addAF_pred,LIST-S2_pred,\
MVP_score,Eigen-phred_coding,SiPhy_29way_logOdds,bStatistic,Interpro_domain,MetaLR_pred,\
GTEx_V8_gene,GTEx_V8_tissue,VEST4_score,REVEL_score,AlphaMissense_score \
--offline --tab --buffer_size 50 --fork 5 --force_overwrite ;

Despite following the suggestions, the issue persists. The script runs with a smaller buffer size but takes a significantly longer time to complete. It appears that higher buffer sizes and fork counts lead to process communication issues.

I tried these commands with VEP versions 111 and 109.3, and the same error occurs. However, when using versions 106.0 or 106.1, it works without any issues.

Could you provide further insights or additional configurations that might help resolve this problem?

I look forward to your guidance on this issue.

Thanks, Ananya