Closed EngineerReversed closed 2 years ago
Hey @EngineerReversed, hope you are having a nice day!
I see that you are using the flag --everything
. Enabling this flag may slow down VEP by 5 times, so I would recommend to manually select the flags you really want to run instead. The slowest of the flags enabled by --everything
are:
--af --af_1kg --af_esp --af_gnomad --max_af
)--pubmed
--hgvs
--regulatory
To improve VEP runtime, please avoid usage of those flags.
Also, if you are able to update to VEP 106.1, you can also look at our Nextflow pipeline that partitions VEP by chromosome: https://github.com/Ensembl/ensembl-vep/tree/release/106/nextflow
While looking at hail documentation, I saw that they have a method to run VEP that may be useful in your case: https://hail.is/docs/0.2/methods/genetics.html#hail.methods.vep
Anyway, if you have suggestions or ideas on how we can improve VEP to work better in AWS and other cloud computing services, feel free to talk with us!
Hope this helps.
Kind regards, Nuno
Sorry, I was out for a week due to personal reasons and thanks for the detailed reply. After doing extensive testing(burning 5000+ pounds), we found out that our VEP annotations ran fine for 300-500 GiBs of data while it stayed in pending state for 1+ TiB data. Based upon my analysis, it seems to be the problem of DNAnexus. Their DNAnexus API count is the limiting factor and we have raised issue to their support team.
--everything
flag
hail usage
how we can improve VEP to work better in AWS and other cloud computing services, feel free to talk with us!
Thanks! Hope you are having good time
Hey @EngineerReversed,
Thanks for the update and I hope you can get your problem solved as soon as possible.
I am going to close this issue now, but please do open new issues related to future queries.
All the best!
Cheers, Nuno
Hi,
Context: I am currently working on UKBB 450K WES data processing in DNAnexus environment. My end goal is to generate gene burden matrix table for entire data. Since, it can be costly and time taking, I have split the data into chromosomes and processing them one by one.
Challenge: I have been trying to annotate one of the chromosomes hail matrix table with VEP annotation but it is taking forever. I looked deeper into the issue and I found out that initially VEP annotation task uses all the cores available but later on at the collect step uses one or two cores thereby other nodes staying idle and just increasing the computational cost.
Approaches taken:
Earlier I thought it was because of partitions being unequal, I tried repartitioning but that didn’t work I also tried cutting down my vep json schema to bare minimum things(in hope that on spot computation of frequencies might be taking time) but that didn’t work. Here are the screenshots describing the same: VEP annotation for chromosomes - Album on Imgur
Cluster specification: No. of nodes: 7 Type of instance: mem2_ssd1_v2_x96 Cloud provider: DNAnexus on AWS Size of input data: 382.9 GiB
Data point: We ended up annotating the chromosome from 382.9 GiB to 2969.2 GiB and paying a cost of 413.6 pounds in time span of 22 hours.
Questions:
What can I do speed up my VEP annotation task? How can I make it possible to process partitions in parallel? If not, should I use a smaller cluster and let it run for 2-3 days as an economic approach? Are there any other ways of annotating via hail? PS: Our VEP cache data sits in HDFS cluster.
System
VEP Json Schema