Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
449 stars 151 forks source link

VEP in Google Batch fails when more than 5 custom databases are passed #1623

Closed MatteoSchiavinato closed 7 months ago

MatteoSchiavinato commented 7 months ago

Describe the issue

I am running VEP with 9 custom databases and 9 plugins within a pipeline that was run locally so far. When running it in google cloud (batch), the VEP command fails silently.

The command reads the custom databases and plugin data from a bucket that is mounted through gcsfuse automatically via nextflow.

When I reduce the number of custom objects (plugins or databases) to 5 or less, the command succeded. I can change which 5 custom databases I pass, it will succeed, but the moment I pass 6 it fails.

Is there some sort of interaction between VEP and GCP that limits the number of custom resources that can be used at the same time?

System

VEP version: 107
VEP Cache version: 107
Perl version: v5.32.1
OS: Ubuntu
tabix installed: yes

Full VEP command line

        ${VEP} \
        --offline --cache --force_overwrite --show_ref_allele \
        --numbers \
        --fork ${task.cpus} \
        --refseq \
        --cache_version ${params.database.vep_cache_version} \
        --dir_cache database/${params.database.cache_VEP} \
        --fasta database/${params.database.fasta_VEP_gz} \
        --dir_plugins ${VEP_PLUGINS} \
        --assembly \${ASSEMBLY} \
        --custom database/${params.database.ESP6500SI},esp6500siv2,vcf,exact,0,MAF \
        --custom database/${params.database.avsnp152},avsnp152,vcf,exact,0,RS \
        --custom database/${params.database.genomicSuperDups_name},${params.genome_version}_genomicSuperDups_name,bed,overlap,0 \
        --custom database/${params.database.genomicSuperDups_fracMatch},${params.genome_version}_genomicSuperDups_fracMatch,bed,overlap,0 \
        --custom database/${params.database.genomicSuperDups_fracMatchIndel},${params.genome_version}_genomicSuperDups_fracMatchIndel,bed,overlap,0 \
        --custom database/${params.database.clinvar},ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN,CLNDISDB,CLNALLELEID \
        --custom database/${params.database.ExAC},exac03,vcf,exact,0,AF,AC,AC_Het,AC_Hom,AC_AFR,AC_AMR,AC_EAS,AC_FIN,AC_NFE,AC_OTH,AC_SAS,AN_AFR,AN_AMR,AN_EAS,AN_FIN,AN_NFE,AN_OTH,AN_SAS,Het_AFR,Het_AMR,Het_EAS,Het_FIN,Het_NFE,Het_OTH,Het_SAS,Hom_AFR,Hom_AMR,Hom_EAS,Hom_FIN,Hom_NFE,Hom_OTH,Hom_SAS \
        --custom database/${params.database.gnomad_exomes},gnomad_exomes,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth,AF_sas,AF_XY,AF_XX,popmax \
        --custom database/${params.database.gnomad_genomes},gnomad_genomes,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth,AF_sas,AF_XY,AF_XX,popmax \
        --plugin LOVD \
        --plugin dbscSNV,database/${params.database.dbscSNV} \
        --plugin dbNSFP,database/${params.database.dbNSFP},transcript_match=1,ALL \
        --plugin satMutMPRA,file=database/${params.database.satMutMPRA},cols=ALL \
        --plugin GeneSplicer,database/${params.database.GeneSplicer}/sources/genesplicer,database/${params.database.GeneSplicer}/human \
        --plugin MaxEntScan,database/${params.database.MaxEntScan},NCSS \
        --plugin SpliceAI,snv=database/${params.database.SpliceAI_snp},indel=database/${params.database.SpliceAI_indel} \
        --plugin pLI,database/${params.database.ExACpLI} \
        --plugin NearestExonJB \
        --af_1kg \
        --hgvs --hgvsg --symbol --nearest symbol --distance 1000 \
        --canonical --exclude_predicted \
        --regulatory \
        -i ${norm_vcf} \
        -o ${sample_id}_VEP_output.raw.vcf \
        -vcf --flag_pick_allele_gene \
        --pick_order mane,canonical,appris,tsl,biotype,ccds,rank,length

Full error message

There is no error message.

Data files (if applicable)

They include:

olaaustine commented 7 months ago

Hi @MatteoSchiavinato, Hope this meets you well? Thank you for your query. If possible, can you share the compute resources used to configure this job on batch. I think this might be an issue with not enough compute resources to run this job. Hoping to hear from you soon. Thank you Ola.

MatteoSchiavinato commented 7 months ago

It's a google batch call, which uses this group of settings:

            withLabel: vep {
                machineType = 'n2-highcpu-96'
                cpus = 96
                maxForks = 1
                disk = "500 GB"
            }
MatteoSchiavinato commented 7 months ago

Does VEP make any use of /dev/shm?

olaaustine commented 7 months ago

Hi @MatteoSchiavinato, We suspect this is a memory issue. Suggesting, you use the option --buffer_size, maybe —buffer_size 500 and also use --verbose so more information can be printed out. Secondly for the GeneSplicer plugin, can you use the option tmpdir. If this does not help, can you use a smaller input file to see if the issue still persists. If it does not, can you the memory directive to determine how much memory the process is allowed to use. Thank you Ola.

MatteoSchiavinato commented 7 months ago

I'm testing the smaller buffer size as you suggested and the command has been running steadily for 30 minutes. Considering that the same command would run well locally in 45 minutes using 30 cpus, and that I'm using 96 in the cloud, can I assume that reducing the buffer by a factor of 10 increases computational times by a factor of 10x too?

Also, is this reduction from 5000 to 500 something that has been tested, i.e. I should not try to increase the buffer, or in your experience would you suggest spending some time fine-tuning the --buffer_size parameter to our cloud setup?

olaaustine commented 7 months ago

Hi @MatteoSchiavinato, Our --buffer_size option is used to set the number of variants that are read in to memory simultaneously. Set this lower to use less memory at the expense of longer run time, and higher to use more memory with a faster run time. So in response to your question, yes reducing the buffer size, increases run time but also less memory or disk will be used. I would also suggest you spend some time fine tuning it to your cloud setup especially working with the memory or disk allocation given to this task. Thank you Ola.