Optimize WGS VCF File Annotation for Improved Performance and Speed

Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants

https://www.ensembl.org/vep

Apache License 2.0

456 stars 151 forks source link

Optimize WGS VCF File Annotation for Improved Performance and Speed #1769

Open Ananya-swi opened 1 month ago

Ananya-swi commented 1 month ago

Hi,

I am working on annotating large datasets, specifically Whole Genome Sequencing (WGS) VCF files, using the Variant Effect Predictor (VEP). However, the annotation process is taking significantly longer than expected. For example, annotating a 1.8GB VCF file took approximately 15 hours.

Environment Details:

Platform: Azure
VM Configuration: 32 vCPUs, 64GB RAM
VEP Setup: Running within a Docker container

I am seeking guidance on how to optimize VEP for faster annotation. Could you provide recommendations on:

Configuring the VM or container for better performance.
Any VEP parameters or caching strategies that could improve processing times.
Alternative VM sizes or architectures that might be better suited for WGS annotations.

Thank you for your support and insights.

Best regards, Ananya Saji

jamie-m-a commented 1 month ago

Hi @Ananya-swi thanks for reaching out to us.

It would be useful to know the full VEP command you are using, so we can try and identify potential speed ups. However, even without that, I can suggest a possible option which is to use our Nextflow VEP, which offers a degree of parallelisation to speed up processing large data.

I should say that we haven't tested it on cloud compute yet - so if you do decide to try it, let us know if you encounter any challenges.

Ananya-swi commented 1 month ago

Hi @jamie-m-a,

Thank you for the recommendation! I’m sharing the full VEP command I used below for your reference:

docker run -i -v /data:/opt/vep/.vep ensemblorg/ensembl-vep:release_106.1 vep --cache --refseq --CACHE_VERSION 106 --dir_plugins /opt/vep/.vep/Plugins --no_stats -i input.vcf -o output.txt --symbol --hgvs --hgvsg --variant_class --gene_phenotype --flag_pick_allele_gene --canonical --appris --ccds --numbers --total_length --mane --sift p --polyphen p --fasta /opt/vep/.vep/homo_sapiens_refseq/106_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz --species homo_sapiens --assembly GRCh37 --af --af_gnomad --no_escape --plugin NMD --dir_plugins /opt/vep/.vep/Plugins --tab --offline --buffer_size 100000 --fork 32 --force_overwrite

I’ll also explore the Nextflow VEP option to see if it speeds up the annotation process. If any issues arise on the cloud platform, I’ll follow up accordingly.

Best regards, Ananya

jamie-m-a commented 1 month ago

Hi @Ananya-swi

No problem! Now that I can see your command, I notice you're not using forks, which can have a significant speed impact. Some general instructions for speeding up Ensembl VEP can be found here.

Let us know how you get on.

Ananya-swi commented 1 month ago

Hi @jamie-m-a,

Thank you for the feedback! I wanted to clarify that I did use the --fork option, setting it to 32. However, I still observed long runtimes, with the process taking around 15 hours for a 1.8GB input VCF.

It would be great if you could share any additional insights or optimization tips, particularly regarding other parameters that could improve performance. I’ll also explore the general recommendations provided in the link you shared.

Looking forward to hearing from you!

Thanks again, Ananya

jamie-m-a commented 1 month ago

Apologies @Ananya-swi - I missed the fork flag. The other easy thing to check is whether your input VCF is properly sorted. Your run time does seem long for a file that size. Can you advise how many variants are in your input?

Ananya-swi commented 1 month ago

Hi @jamie-m-a,

Thank you for your response! I appreciate the suggestion about checking the sorting of my input VCF. I have confirmed that the VCF file is sorted correctly.

Regarding your question, the input VCF contains 6,139,369 variants.

Thanks again for your help!

Best, Ananya

jamie-m-a commented 1 month ago

Thanks for the update @Ananya-swi the run time does seem slow - I'll try running some tests on a similarly sized input and get back to you.