Open Ananya-swi opened 1 month ago
Hi @Ananya-swi thanks for reaching out to us.
It would be useful to know the full VEP command you are using, so we can try and identify potential speed ups. However, even without that, I can suggest a possible option which is to use our Nextflow VEP, which offers a degree of parallelisation to speed up processing large data.
I should say that we haven't tested it on cloud compute yet - so if you do decide to try it, let us know if you encounter any challenges.
Hi @jamie-m-a,
Thank you for the recommendation! I’m sharing the full VEP command I used below for your reference:
docker run -i -v /data:/opt/vep/.vep ensemblorg/ensembl-vep:release_106.1 vep --cache --refseq --CACHE_VERSION 106 --dir_plugins /opt/vep/.vep/Plugins --no_stats -i input.vcf -o output.txt --symbol --hgvs --hgvsg --variant_class --gene_phenotype --flag_pick_allele_gene --canonical --appris --ccds --numbers --total_length --mane --sift p --polyphen p --fasta /opt/vep/.vep/homo_sapiens_refseq/106_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz --species homo_sapiens --assembly GRCh37 --af --af_gnomad --no_escape --plugin NMD --dir_plugins /opt/vep/.vep/Plugins --tab --offline --buffer_size 100000 --fork 32 --force_overwrite
I’ll also explore the Nextflow VEP option to see if it speeds up the annotation process. If any issues arise on the cloud platform, I’ll follow up accordingly.
Best regards, Ananya
Hi @Ananya-swi
No problem! Now that I can see your command, I notice you're not using forks, which can have a significant speed impact. Some general instructions for speeding up Ensembl VEP can be found here.
Let us know how you get on.
Hi @jamie-m-a,
Thank you for the feedback! I wanted to clarify that I did use the --fork option, setting it to 32. However, I still observed long runtimes, with the process taking around 15 hours for a 1.8GB input VCF.
It would be great if you could share any additional insights or optimization tips, particularly regarding other parameters that could improve performance. I’ll also explore the general recommendations provided in the link you shared.
Looking forward to hearing from you!
Thanks again, Ananya
Apologies @Ananya-swi - I missed the fork flag. The other easy thing to check is whether your input VCF is properly sorted. Your run time does seem long for a file that size. Can you advise how many variants are in your input?
Hi @jamie-m-a,
Thank you for your response! I appreciate the suggestion about checking the sorting of my input VCF. I have confirmed that the VCF file is sorted correctly.
Regarding your question, the input VCF contains 6,139,369 variants.
Thanks again for your help!
Best, Ananya
Thanks for the update @Ananya-swi the run time does seem slow - I'll try running some tests on a similarly sized input and get back to you.
Hi,
I am working on annotating large datasets, specifically Whole Genome Sequencing (WGS) VCF files, using the Variant Effect Predictor (VEP). However, the annotation process is taking significantly longer than expected. For example, annotating a 1.8GB VCF file took approximately 15 hours.
Environment Details:
I am seeking guidance on how to optimize VEP for faster annotation. Could you provide recommendations on:
Thank you for your support and insights.
Best regards, Ananya Saji