broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 589 forks source link

JointDiscovery Workflow Errors due to Java Heap Space? #6165

Closed vymao closed 4 years ago

vymao commented 5 years ago

Bug Report

I was running the JointDiscovery pipeline as a part of the GATK Best Practices pipeline. I am running this on many vcf files (~150) called by the HaplotypeCaller. I am getting this error:

19:01:58.009 WARN  VariantDataManager - WARNING: Very large training set detected. Downsampling to 2500000 training variants.
19:04:18.918 INFO  VariantRecalibrator - Shutting down engine
[September 16, 2019 7:04:18 PM EDT] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 912.93 minutes.
Runtime.totalMemory()=3204972544
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.broadinstitute.hellbender.tools.walkers.vqsr.MultivariateGaussian.<init>(MultivariateGaussian.java:31)
    at org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel.<init>(GaussianMixtureModel.java:34)
    at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:43)
    at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.onTraversalSuccess(VariantRecalibrator.java:625)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:895)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

I believe this is derived from an error earlier in the log, since the stderr gives the same Java heap space error:

[2019-09-16 19:05:59,50] [error] WorkflowManagerActor Workflow 9f7a01a4-0632-4817-8622-aa51e520abf1 failed (during ExecutingWorkflowState): Job JointGenotyping.SNPsVariantRecalibratorClassic:NA:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: /path/to/stderr.

I have read past issues (https://gatkforums.broadinstitute.org/gatk/discussion/23880/java-heap-space) regarding this that may suggest it is a bug. It has pointed me to increasing the available heap memory through the primary command of -Xmx. Is this the way to do it?

java -Xmx600G -Dconfig.file=' + re.sub('input.json', 'overrides.conf', input_json) + ' -jar ' + args.cromwell_path + ' run ' + re.sub('input.json', 'joint-discovery-gatk4.wdl', input_json) + ' -i ' + input_json

where I substitute in the corresponding config, json, and wdl files.

Is 600G enough? Each vcf is around 6G large and since I have 150, does that mean I should be allocating more than 900G (6G x 150)?

ldgauthier commented 5 years ago

Why are you running VariantRecalibrator on multiple files? In the current implementation the tool does read all the variants into memory, so merging the files somehow before would dramatically reduce the memory requirements.

kaboroevich commented 4 years ago

I believe your issue is that you are assigning 600GB to execution of cromwell, but the error is with the call to VariantRecalibrator in one of the tasks not having enough memory. A few tasks call VariantRecalibrator, do you know which task failed? Can you post the java call from the STDERR file? For me, it was task SNPsVariantRecalibrator which was assigned only 3.5GB of memory by default.

In joint-discovery-gatk4.wdl, the memory assigned for each task can be set via "machine_mem_gb", but it looks like the current input.json does not have that variable, but instead "mem_size" for each task.

A simple solution would be to replace ${java_mem} with a static value in calls to VariantRecalibrator (lines 564 & 684). For example, replace:

${gatk_path} --java-options "-Xmx${java_mem}g -Xms${java_mem}g"

with

${gatk_path} --java-options "-Xmx100g -Xms100g"

I'm not certain this will help, but I think it's a step in the right direction.