Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
455 stars 152 forks source link

VEP annotation taking forever with nodes sitting idle in DNAnexus #1199

Closed EngineerReversed closed 2 years ago

EngineerReversed commented 2 years ago

Hi,

Context: I am currently working on UKBB 450K WES data processing in DNAnexus environment. My end goal is to generate gene burden matrix table for entire data. Since, it can be costly and time taking, I have split the data into chromosomes and processing them one by one.

Challenge: I have been trying to annotate one of the chromosomes hail matrix table with VEP annotation but it is taking forever. I looked deeper into the issue and I found out that initially VEP annotation task uses all the cores available but later on at the collect step uses one or two cores thereby other nodes staying idle and just increasing the computational cost.

Approaches taken:

Earlier I thought it was because of partitions being unequal, I tried repartitioning but that didn’t work I also tried cutting down my vep json schema to bare minimum things(in hope that on spot computation of frequencies might be taking time) but that didn’t work. Here are the screenshots describing the same: VEP annotation for chromosomes - Album on Imgur

Cluster specification: No. of nodes: 7 Type of instance: mem2_ssd1_v2_x96 Cloud provider: DNAnexus on AWS Size of input data: 382.9 GiB

Data point: We ended up annotating the chromosome from 382.9 GiB to 2969.2 GiB and paying a cost of 413.6 pounds in time span of 22 hours.

Questions:

What can I do speed up my VEP annotation task? How can I make it possible to process partitions in parallel? If not, should I use a smaller cluster and let it run for 2-3 days as an economic approach? Are there any other ways of annotating via hail? PS: Our VEP cache data sits in HDFS cluster.

System

VEP Json Schema

{
  "command": [
    "docker", "run", "-i", "-v", "/cluster/vep:/root/.vep", "dnanexus/dxjupyterlab-vep",
     "./vep", "--format", "vcf", "__OUTPUT_FORMAT_FLAG__", "--everything", "--allele_number",
     "--no_stats", "--cache", "--offline", "--fork", "16", "--minimal", "--assembly", "GRCh38", "-o", "STDOUT",
     "--check_existing", "--dir_cache", "/root/.vep/",
     "--fasta", "/root/.vep/homo_sapiens/103_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz",
    "--plugin", "LoF,loftee_path:/root/.vep/Plugins/loftee,human_ancestor_fa:/root/.vep/human_ancestor.fa,conservation_file:/root/.vep/loftee.sql,gerp_bigwig:/root/.vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw"
  ],
    "env": {"PERL5LIB": "/cluster/vep:/cluster"},
    "vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele:String,amr_maf:Float64,clin_sig:Array[String],start:Int32,strand:Int32,end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele: String,exac_fin_maf: Float64,exac_maf: Float64,exac_nfe_allele: String,exac_nfe_maf:Float64,exac_oth_allele: String,exac_oth_maf: Float64,exac_sas_allele: String,exac_sas_maf: Float64,id:String,minor_allele: String,minor_allele_freq: Float64,phenotype_or_disease: Int32,pubmed: Array[Int32],frequencies: Dict[String,Struct{sas: Float64,afr: Float64,gnomad_nfe: Float64,gnomad: Float64,gnomad_fin: Float64,gnomad_eas: Float64,gnomad_afr: Float64,amr: Float64,gnomad_oth: Float64,ea: Float64,eur: Float64,gnomad_asj: Float64,eas: Float64,gnomad_amr: Float64,gnomad_sas: Float64,aa: Float64}],sas_allele: String,sas_maf: Float64,somatic: Int32}],context: String,end: Int32,id: String,input: String,intergenic_consequences: Array[Struct{allele_num: Int32, consequence_terms: Array[String], impact: String, minimised: Int32, variant_allele: String}],most_severe_consequence: String,motif_feature_consequences: Array[Struct{allele_num: Int32, consequence_terms: Array[String], high_inf_pos: String, impact: String, minimised: Int32, motif_feature_id: String, motif_name: String, motif_pos: Int32, motif_score_change: Float64, strand: Int32, variant_allele: String}],regulatory_feature_consequences: Array[Struct{allele_num: Int32, biotype: String,consequence_terms: Array[String], impact: String, minimised: Int32, regulatory_feature_id: String, variant_allele: String}],seq_region_name: String,start: Int32,strand: Int32,transcript_consequences: Array[Struct{allele_num: Int32, amino_acids: String,appris: String, biotype: String, canonical:Int32, ccds: String, cdna_start: Int32,cdna_end: Int32, cds_end: Int32, cds_start:Int32, codons: String, consequence_terms: Array[String], distance: Int32, domains: Array[Struct{db: String, name: String}], exon: String, gene_id: String, gene_pheno: Int32, gene_symbol: String, gene_symbol_source: String, hgnc_id: String, hgvsc: String, hgvsp: String, hgvs_offset: Int32, impact: String, intron: String, lof: String, lof_flags: String, lof_filter: String, lof_info: String, minimised: Int32, polyphen_prediction: String, polyphen_score: Float64, protein_end: Int32, protein_start: Int32, protein_id: String, sift_prediction: String, sift_score: Float64, strand: Int32, swissprot: String, transcript_id: String, trembl: String, tsl: Int32, uniparc: String, variant_allele: String}],variant_class: String}"
}
nuno-agostinho commented 2 years ago

Hey @EngineerReversed, hope you are having a nice day!

I see that you are using the flag --everything. Enabling this flag may slow down VEP by 5 times, so I would recommend to manually select the flags you really want to run instead. The slowest of the flags enabled by --everything are:

To improve VEP runtime, please avoid usage of those flags.

Also, if you are able to update to VEP 106.1, you can also look at our Nextflow pipeline that partitions VEP by chromosome: https://github.com/Ensembl/ensembl-vep/tree/release/106/nextflow

While looking at hail documentation, I saw that they have a method to run VEP that may be useful in your case: https://hail.is/docs/0.2/methods/genetics.html#hail.methods.vep

Anyway, if you have suggestions or ideas on how we can improve VEP to work better in AWS and other cloud computing services, feel free to talk with us!

Hope this helps.

Kind regards, Nuno

EngineerReversed1 commented 2 years ago

Sorry, I was out for a week due to personal reasons and thanks for the detailed reply. After doing extensive testing(burning 5000+ pounds), we found out that our VEP annotations ran fine for 300-500 GiBs of data while it stayed in pending state for 1+ TiB data. Based upon my analysis, it seems to be the problem of DNAnexus. Their DNAnexus API count is the limiting factor and we have raised issue to their support team.

--everything flag

hail usage

how we can improve VEP to work better in AWS and other cloud computing services, feel free to talk with us!

Thanks! Hope you are having good time

nuno-agostinho commented 2 years ago

Hey @EngineerReversed,

Thanks for the update and I hope you can get your problem solved as soon as possible.

I am going to close this issue now, but please do open new issues related to future queries.

All the best!

Cheers, Nuno