broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Java core dump in ApplyBQSR Spark due to GKL deflater #3605

Closed chapmanb closed 6 years ago

chapmanb commented 7 years ago

I'm running into a consistent core dump in GATK 4 beta 5 (GKL 0.5.8) related to deflation with the Intel Genomics Library. This occurs on a AWS m4.4xlarge machine running Ubuntu 16.04 and consistently core dumps and provides this stack trace:

https://gist.github.com/chapmanb/006c1c9abeb21e9baf244d17d7ae1003

Running ApplyBQSR:

unset JAVA_HOME && export PATH=/mnt/work/bcbio/anaconda/bin:$PATH && gatk-launch --javaOptions '-Xms1000m -Xmx46965m -XX:+UseSerialGC -Djava.io.tmpdir=/mnt/work/cwl/bcbio_validation_workflows/somatic-giab-mix/bunny_work/main-somatic-giab-mix-2017-09-23-094842.494/root/postprocess_alignment/2/bcbiotx/tmpLCoup3' ApplyBQSRSpark --sparkMaster local[16] --input /mnt/work/cwl/bcbio_validation_workflows/somatic-giab-mix/bunny_work/main-somatic-giab-mix-2017-09-22-201054.451/root/alignment/2/merge_split_alignments/align/giab-mix-tumor/giab-mix-tumor-sort.bam --output /mnt/work/cwl/bcbio_validation_workflows/somatic-giab-mix/bunny_work/main-somatic-giab-mix-2017-09-23-094842.494/root/postprocess_alignment/2/bcbiotx/tmpLCoup3/giab-mix-tumor-sort-recal.bam --bqsr_recal_file /mnt/work/cwl/bcbio_validation_workflows/somatic-giab-mix/bunny_work/main-somatic-giab-mix-2017-09-23-094842.494/root/postprocess_alignment/2/align/giab-mix-tumor/giab-mix-tumor-sort-recal.grp

Adding --use_jdk_deflater to the ApplyBQSR command line avoids the issue.

I'm not sure if the java stack dump and command line provide enough information to be useful or if having a reproducible case is needed. The case above reproduces but has fairly large BAM files and I haven't been able to get a more minimal case, but I could prepare and share if it would be helpful. Thanks much for looking at this.

droazen commented 7 years ago

@gspowley @erniebrau @pnvaidya Could one of you please have a look at this issue?

gspowley commented 7 years ago

@chapmanb We were able to reproduce a failure with your command line. This looks like an issue related to JNI and garbage collection that is exposed by setting -Xmx46965m and -XX:+UseSerialGC, but it needs further debugging.

To confirm, can you please try running without specifying these javaOptions? Something like this:

./gatk-launch --javaOptions '-Djava.io.tmpdir=$TEMP_DIR' \
  ApplyBQSRSpark \
  --sparkMaster local[16] \
  --input $BAM_IN \
  --output $BAM_OUT \
  --bqsr_recal_file $BQSR_RECAL \
  -- \
  --conf spark.local.dir=$SPARK_LOCAL_DIR

FYI, we see better performance from Spark when using an SSD for spark.local.dir. The --conf option above shows how to set the spark.local.dir.

chapmanb commented 7 years ago

George -- thanks much for debugging and identifying the underlying problem. I can confirm that we're able to avoid the error by removing -XX:+UseSerialGC and moving back to parallel GC. We'd initially introduced the serial GC usage to avoid problems when running multiple HaplotypeCaller commands simultaneously on a single machine but by letting the Spark implementation take care of parallelizing we should no longer need to worry about that. Thanks again for the workaround and the tip on using spark.local.dir. Much appreciated.

gspowley commented 6 years ago

Thanks for the feedback Brad. We'll continue to look into the core dump to make sure it doesn't cause issues in the future.

gspowley commented 6 years ago

This issue is related to https://github.com/Intel-HLS/GKL/issues/81

vdauwera commented 6 years ago

I'm assuming that the recent GKL update addresses this, so am closing based on the girl scout principle (find it broken? fix it), but feel free to reopen.

https://github.com/broadinstitute/gatk/pull/3865