broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.64k stars 580 forks source link

GATK4 BaseRecalibratorSpark , Executor heartbeat timed out after X ms #4515

Open Tintest opened 6 years ago

Tintest commented 6 years ago

Hello, i'm trying to run GATK4 BaseRecalibratorSpark in local in a nextflow pipeline, but I got every time the same error with this command :

gatk-launch BaseRecalibratorSpark --spark-runner LOCAL --spark-master local[23] -R $indexbit -I ${input_bam} --known-sites ${params.dbsnpAll} -L ${params.targetBed} -O ${output_name}-recalibration_report.grp

Here is the full error output, basically, the failure seems to be linked to an absence of activity of the BaseRecalibratorSpark during 12s which kill the spark heartbeat and then the processus :

    18/03/09 09:22:08 WARN Executor: Issue communicating with driver in heartbeater
    java.lang.NullPointerException
            at org.apache.spark.storage.memory.MemoryStore.getSize(MemoryStore.scala:127)
            at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:387)
            at org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:218)
            at org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:217)
            at scala.collection.Iterator$class.foreach(Iterator.scala:893)
            at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
            at org.apache.spark.storage.BlockManager.reportAllBlocks(BlockManager.scala:217)
            at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:236)
            at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:522)
            at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
            at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
            at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
            at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    18/03/09 09:22:08 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    18/03/09 09:22:08 INFO SparkContext: Successfully stopped SparkContext
    09:22:08.389 INFO  BaseRecalibratorSpark - Shutting down engine
    [March 9, 2018 9:22:08 AM UTC] org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark done. Elapsed time: 61.53 minutes.
    Runtime.totalMemory()=16815489024
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 0.0 failed 1 times, most recent failure: Lost task 8.0 in stage 0.0 (TID 8, localhost): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 126542 ms
    Driver stacktrace:
            at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
            at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
            at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
            at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
            at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
            at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
            at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
            at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
            at scala.Option.foreach(Option.scala:257)
            at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
            at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
            at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:1936)
            at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1002)
            at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
            at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
            at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
            at org.apache.spark.rdd.RDD.reduce(RDD.scala:984)
            at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1127)
            at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
            at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
            at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
            at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1104)
            at org.apache.spark.api.java.JavaRDDLike$class.treeAggregate(JavaRDDLike.scala:438)
            at org.apache.spark.api.java.AbstractJavaRDDLike.treeAggregate(JavaRDDLike.scala:45)
            at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.apply(BaseRecalibratorSparkFn.java:39)
            at org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark.runTool(BaseRecalibratorSpark.java:159)
            at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:387)
            at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:159)
            at org.broadinstitute.hellbender.Main.mainEntry(Main.java:201)
            at org.broadinstitute.hellbender.Main.main(Main.java:287)
    18/03/09 09:22:08 INFO ShutdownHookManager: Shutdown hook called
    18/03/09 09:22:08 INFO ShutdownHookManager: Deleting directory /tmp/qtestard/spark-af28085b-4317-4d0f-85dd-a966b1d26d9d
[command.txt](https://github.com/broadinstitute/gatk/files/1796441/command.txt)

I got the same one when I try to execute it outside of nextflow. I also tried to run it with --conf spark.executor.heartbeatInterval=120, but it seems it is useless, i'm not sure it is the good syntax for a local execution of spark.

Here is attached, the complete log file. command.txt

Can you help me ?

Also posted on the GATK forum

Best regards.

lbergelson commented 6 years ago

@Tintest Sorry for the slow reply. I'm not sure exactly what the issue is. I've never seen this exact error before.

I have two guesses, one is that there's something really weird going on in spark that's causing that null pointer exception which is killing the heartbeat. I'm not sure how to debug that without your access to your input data

The other theory which I think is more likely, is that you're running out of memory and it's causing weird errors to occur.

How much memory is available to your job? You can set the java -Xmx value with --java-options "-Xmx120g" as a GATK option. I would check that you're not running out of memory on your machine, or giving the job too little. I think for BaseRecalibratorSpark you want at least 2-4g per core, but haven't tested it in a long time so I might be wrong about that.

Tintest commented 6 years ago

Thank you @lbergelson for your answer.

The memory I'm using is specified by nextflow, but I could also force it to Java.

My machine got 64 go of memory and 24 cores. 2go by cores is ok but more could be problematic. Maybe I will try on a more powerful one and I will tell you how is it going.

Does the Spark strategy really needs more memory ? I've seen memory usage peaks around 60 go before crash with the spark tool and around 25go with the "classic" version (which complete the run without any issue).

Thank you.

oldmikeyang commented 5 years ago

I got the same problem during init the reference The total memory is 991428608???

org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark done. Elapsed time: 0.13 minutes.
Runtime.totalMemory()=991428608
18/09/15 17:21:28 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.broadinstitute.hellbender.relocated.com.google.common.io.ByteStreams.copy(ByteStreams.java:74)
    at org.broadinstitute.hellbender.relocated.com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:115)
    at org.broadinstitute.hellbender.engine.spark.datasources.ReferenceTwoBitSource.<init>(ReferenceTwoBitSource.java:40)
    at org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource.<init>(ReferenceMultiSource.java:42)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReference(GATKSparkTool.java:500)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:468)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:459)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:188)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:207)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
18/09/15 17:21:28 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.OutOfMemoryError: Java heap space)
18/09/15 17:21:28 INFO util.ShutdownHookManager: Shutdown hook called
18/09/15 17:21:28 INFO util.ShutdownHookManager: Deleting directory /tmp/hadoop-test/nm-local-dir/usercache/test/appcache/application_1537046459822_0001/spark-d0ea2344-815c-4bbf-a11b-35031a9c4940
oldmikeyang commented 5 years ago

after increase both the driver and executor memory to 12G, the problem go way.