HaplotypeCaller native SmithWaterman: core dumps and JVM errors

nh13 commented 4 years ago

Hey folks,

I have a test dataset that interestingly core-dumps or JVM errors with --smith-waterman FASTEST_AVAILABLE but not with --smith-waterman JAVA. The only thing I can think of is somehow Intel's HMM has a length limitation, as I am using --assembly-region-padding 1000 to GATK to call 100-1000bp indels (and it works!). I cannot share the test BAM unfortunately. What can I do to help debug further?

I'm using gatk4-4.1.8.1-0 from conda create -n debug-gatk4 -c defaults -c conda-forge -c bioconda gatk4.

$gatk ... -version
The Genome Analysis Toolkit (GATK) v4.1.8.1
HTSJDK Version: 2.23.0
Picard Version: 2.22.8
$ java -version
openjdk version "1.8.0_152-release"
OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12)
OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)

First error motif:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000010efa9dc2, pid=23946, tid=0x000000000000a503
#
# JRE version: OpenJDK Runtime Environment (8.0_152-b12) (build 1.8.0_152-release-1056-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.152-b12 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.dylib+0x3a9dc2]  PhaseIdealLoop::set_ctrl(Node*, Node*)+0x10
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
#
# Compiler replay data is saved as:
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp

Second error motif:

java(24057,0x7000035bd000) malloc: Incorrect checksum for freed object 0x7fd8a8193600: probably modified after being freed.
Corrupt value: 0x2e4630002e47e
java(24057,0x7000035bd000) malloc: *** set a breakpoint in malloc_error_break to debug

droazen commented 4 years ago

@nh13 As a first step I'd suggest filing a ticket against the GKL (https://github.com/Intel-HLS/GKL) so that Intel engineers can have a look (but leave this GATK ticket open so that we can track it here as well).

This will be difficult to debug without a test case that Intel can run to reproduce the issue on their end. Does the crash only occur with this one particular sample, or have you seen it on more than one sample? If you could get to the point where you can reproduce it on a shareable bam snippet, that would obviously maximize the chances of this getting fixed.

Intel is currently (at our request) doing a pass on the GKL with valgrind to find and fix memory safety issues (https://github.com/Intel-HLS/GKL/issues/107), so we expect the next GKL release to fix a bunch of "use after free"-type errors. Maybe they'll get lucky and fix this one as well. Timeline for the release is within the next ~2-3 months.

After that we've asked them to test the GKL with long reads data, which is also known to trigger crashes like this (https://github.com/Intel-HLS/GKL/issues/105). If the problem in your case is that you've exceeded some hardcoded length limitation, the tests on long reads data might reveal the problem.

nh13 commented 4 years ago

Let me try to synthesize a BAM and see if that works. It fails on many different samples (~1000+).

droazen commented 4 years ago

@nh13 Great, that would help a lot. Could you also post your complete HaplotypeCaller command line?

nh13 commented 4 years ago

Fails with command:

gatk \
    --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g" \
    --spark-runner LOCAL HaplotypeCaller \
    -ERC GVCF \
    -I test.bam \
    -O out.bam \
    -R ucsc.hg19.fasta \
    --assembly-region-padding 1000 \
    --smith-waterman FASTEST_AVAILABLE \
    -L "chr1:26644000-26646000";

Changing --smith-waterman FASTEST_AVAILABLE to --smith-waterman JAVA works just fine.

test.sam

``` @HD VN:1.4 SO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chr1_gl000191_random LN:106433 @SQ SN:chr1_gl000192_random LN:547496 @SQ SN:chr4_ctg9_hap1 LN:590426 @SQ SN:chr4_gl000193_random LN:189789 @SQ SN:chr4_gl000194_random LN:191469 @SQ SN:chr6_apd_hap1 LN:4622290 @SQ SN:chr6_cox_hap2 LN:4795371 @SQ SN:chr6_dbb_hap3 LN:4610396 @SQ SN:chr6_mann_hap4 LN:4683263 @SQ SN:chr6_mcf_hap5 LN:4833398 @SQ SN:chr6_qbl_hap6 LN:4611984 @SQ SN:chr6_ssto_hap7 LN:4928567 @SQ SN:chr7_gl000195_random LN:182896 @SQ SN:chr8_gl000196_random LN:38914 @SQ SN:chr8_gl000197_random LN:37175 @SQ SN:chr9_gl000198_random LN:90085 @SQ SN:chr9_gl000199_random LN:169874 @SQ SN:chr9_gl000200_random LN:187035 @SQ SN:chr9_gl000201_random LN:36148 @SQ SN:chr11_gl000202_random LN:40103 @SQ SN:chr17_ctg5_hap1 LN:1680828 @SQ SN:chr17_gl000203_random LN:37498 @SQ SN:chr17_gl000204_random LN:81310 @SQ SN:chr17_gl000205_random LN:174588 @SQ SN:chr17_gl000206_random LN:41001 @SQ SN:chr18_gl000207_random LN:4262 @SQ SN:chr19_gl000208_random LN:92689 @SQ SN:chr19_gl000209_random LN:159169 @SQ SN:chr21_gl000210_random LN:27682 @SQ SN:chrUn_gl000211 LN:166566 @SQ SN:chrUn_gl000212 LN:186858 @SQ SN:chrUn_gl000213 LN:164239 @SQ SN:chrUn_gl000214 LN:137718 @SQ SN:chrUn_gl000215 LN:172545 @SQ SN:chrUn_gl000216 LN:172294 @SQ SN:chrUn_gl000217 LN:172149 @SQ SN:chrUn_gl000218 LN:161147 @SQ SN:chrUn_gl000219 LN:179198 @SQ SN:chrUn_gl000220 LN:161802 @SQ SN:chrUn_gl000221 LN:155397 @SQ SN:chrUn_gl000222 LN:186861 @SQ SN:chrUn_gl000223 LN:180455 @SQ SN:chrUn_gl000224 LN:179693 @SQ SN:chrUn_gl000225 LN:211173 @SQ SN:chrUn_gl000226 LN:15008 @SQ SN:chrUn_gl000227 LN:128374 @SQ SN:chrUn_gl000228 LN:129120 @SQ SN:chrUn_gl000229 LN:19913 @SQ SN:chrUn_gl000230 LN:43691 @SQ SN:chrUn_gl000231 LN:27386 @SQ SN:chrUn_gl000232 LN:40652 @SQ SN:chrUn_gl000233 LN:45941 @SQ SN:chrUn_gl000234 LN:40531 @SQ SN:chrUn_gl000235 LN:34474 @SQ SN:chrUn_gl000236 LN:41934 @SQ SN:chrUn_gl000237 LN:45867 @SQ SN:chrUn_gl000238 LN:39939 @SQ SN:chrUn_gl000239 LN:33824 @SQ SN:chrUn_gl000240 LN:41933 @SQ SN:chrUn_gl000241 LN:42152 @SQ SN:chrUn_gl000242 LN:43523 @SQ SN:chrUn_gl000243 LN:43341 @SQ SN:chrUn_gl000244 LN:39929 @SQ SN:chrUn_gl000245 LN:36651 @SQ SN:chrUn_gl000246 LN:38154 @SQ SN:chrUn_gl000247 LN:36422 @SQ SN:chrUn_gl000248 LN:39786 @SQ SN:chrUn_gl000249 LN:38502 @RG ID:test-sample SM:test-sample PL:ILLUMINA A00226:391:HKFF3DRXX:1:2218:20229:31485 16 chr1 26644464 60 70M18D65M16S * 0 0 CACCTAAAAAGCTGCTACCAAGACAGCCACGAAGATCCTACCAAAATGAAGCGCTTCCTCTTCCTCCTACCTATGGTACAGGTAAGAGTAACGCCTGGCACCACTGCCAGGACTCCCCAAGGTTGCTTGGCATGGGAAGGGACTACTTCAA FFF:FFFF,FFF,FFFFFFF,:F::FFFFFFFFFFFFF,FFFF:::FFFFFFFF,FFFFF,FFFFFF,,FFFFFF::FFFFFFFFFFFFFFFFFF:FFFFF:FF,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF MD:Z:70^TCACCATCAGCCTCCTGG0T17C31A14 RG:Z:test-sample NM:i:21 AS:i:96 XS:i:20 A00226:391:HKFF3DRXX:1:2218:20419:31532 16 chr1 26644464 60 70M18D65M16S * 0 0 CACCTAAAAAGCTGCTACCAAGACAGCCACGAAGATCCTACCAAAATGAAGCGCTTCCTCTTCCTCCGACCTATGGTACAGGTAAGAGTAACGCCTGGCACCACTGCCAGGACTCCCCAAGGTTGCTTGGCATGGGAAGGGACTACTTCAA FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFF,FFFFF:FFFFFF,:FFF,FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF MD:Z:67T2^TCACCATCAGCCTCCTGG0T17C31A14 RG:Z:test-sample NM:i:22 AS:i:91 XS:i:0 A00226:391:HKFF3DRXX:1:2246:13404:22294 16 chr1 26644472 60 62M18D87M2S * 0 0 AAGCTGCTACCAAGACAGCCACGAAGATCCTACCAAAATGAAGCGCTTCCTCTTCCTCCTACCTATGGTACAGGTAAGAGCAACGCCTGGCACCACTGCCAGGACTCCCCAAAGTTGCTTGGCATGGAGGGAGGGCATACAGGATGTGACT FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,FFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF MD:Z:62^TCACCATCAGCCTCCTGG0T86 RG:Z:test-sample NM:i:19 AS:i:120 XS:i:20 ```

nh13 commented 4 years ago

@droazen I posted the complete command line I used (the version is above). I posted a test.sam that reproducibly fails on my machine (OSX). And below is the log from my machine:

22:42:22.298 INFO  NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/nhomer/miniconda3/envs/bfx/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
Aug 01, 2020 10:42:22 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
22:42:22.412 INFO  HaplotypeCaller - ------------------------------------------------------------
22:42:22.412 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.1.8.1
22:42:22.412 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
22:42:22.412 INFO  HaplotypeCaller - Executing as nhomer@ip-192-168-7-102.ec2.internal on Mac OS X v10.14.6 x86_64
22:42:22.412 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_192-b01
22:42:22.412 INFO  HaplotypeCaller - Start Date/Time: August 1, 2020 10:42:22 PM MST
22:42:22.412 INFO  HaplotypeCaller - ------------------------------------------------------------
22:42:22.412 INFO  HaplotypeCaller - ------------------------------------------------------------
22:42:22.413 INFO  HaplotypeCaller - HTSJDK Version: 2.23.0
22:42:22.413 INFO  HaplotypeCaller - Picard Version: 2.22.8
22:42:22.413 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
22:42:22.413 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
22:42:22.413 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
22:42:22.413 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
22:42:22.413 INFO  HaplotypeCaller - Deflater: IntelDeflater
22:42:22.413 INFO  HaplotypeCaller - Inflater: IntelInflater
22:42:22.413 INFO  HaplotypeCaller - GCS max retries/reopens: 20
22:42:22.413 INFO  HaplotypeCaller - Requester pays: disabled
22:42:22.413 INFO  HaplotypeCaller - Initializing engine
22:42:22.705 INFO  IntervalArgumentCollection - Processing 2001 bp from intervals
22:42:22.710 INFO  HaplotypeCaller - Done initializing engine
22:42:22.712 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
22:42:22.719 INFO  NativeLibraryLoader - Loading libgkl_utils.dylib from jar:file:/Users/nhomer/miniconda3/envs/bfx/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_utils.dylib
22:42:22.720 INFO  NativeLibraryLoader - Loading libgkl_smithwaterman.dylib from jar:file:/Users/nhomer/miniconda3/envs/bfx/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.dylib
22:42:22.722 INFO  SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
22:42:22.724 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
22:42:22.724 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
22:42:22.734 WARN  NativeLibraryLoader - Unable to find native library: native/libgkl_pairhmm_omp.dylib
22:42:22.734 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
22:42:22.734 INFO  NativeLibraryLoader - Loading libgkl_pairhmm.dylib from jar:file:/Users/nhomer/miniconda3/envs/bfx/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm.dylib
22:42:22.748 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
22:42:22.748 WARN  IntelPairHmm - Ignoring request for 4 threads; not using OpenMP implementation
22:42:22.748 INFO  PairHMM - Using the AVX-accelerated native PairHMM implementation
22:42:22.751 WARN  GATKVariantContextUtils - Can't determine output variant file format from output file extension "bam". Defaulting to VCF.
22:42:22.776 INFO  ProgressMeter - Starting traversal
22:42:22.777 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000010f47efd3, pid=96919, tid=0x0000000000002303
#
# JRE version: OpenJDK Runtime Environment (8.0_192-b01) (build 1.8.0_192-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.192-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libgkl_smithwaterman4496658849792952100.dylib+0x1fd3]  _Z22smithWatermanBackTrackP10dnaSeqPairiiiiPii+0x3c3
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /private/tmp/hs_err_pid96919.log
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

rpomaris commented 4 years ago

@nh13 Thank you for documenting this. Could you please open a ticket over under https://github.com/Intel-HLS/GKL/issues? We're working on a new release and tracking issues over there.

nh13 commented 4 years ago

@mepowers already done (see linked issue above): https://github.com/Intel-HLS/GKL/issues/113

Kmannth commented 4 years ago

@nh13 Do you see this both on Mac and Linux or just Mac?

nh13 commented 4 years ago

~I have not tried it on a Linux machine, does it work on yours?~ It was on a linux machine

rpomaris commented 4 years ago

@nh13 - @Kmannth and I met today and discussed this ticket. As @droazen mentioned above, we're currently working on a new release, primarily bringing dependencies up to date and addressing a couple reported bugs related to processing SNPs and indels. The current release of GKL was primarily tested on SNPs and indels, and not long reads. We have a few open tickets surrounding long reads. We plan on addressing these in a future release. We will pull your ticket into that body of work.

In the meantime please let us know if you run into GKL issues with short reads, and we will be sure to prioritize for our pending release.

nh13 commented 4 years ago

@mepowers to clarify, this is not a “long read”, rather a short read with a long indel. For cases like this, I’d expect if the intel HMM fails, GATK should fall back to the Java one automatically. Or can you detect what’s “too long” and use the Java version?

nh13 commented 4 years ago

See the cigar: 70M18D65M16S. It’s a 151bp read, which is standard for illumina.

rpomaris commented 4 years ago

@nh13 Thank you for clarifying. As of today GKL does not auto-detect when there's a read that's "too long," ie a read length we haven't validated with GKL. We should be able to build that into our pending release. I agree we should also make sure that if the GKL pairHMM fails, the JAVA version is called instead. @Kmannth @droazen let's discuss this in our next sync.

nh13 commented 4 years ago

Thanks for considering the request and let me know if I can help.

Kmannth commented 4 years ago

@nh13 It was decided today would would work to make GLK defensive with this issue as a first step.

Right now: smithwaterman_common.h:#define MAX_SEQ_LEN 1024 I think this may be part of the issue but '''-Xmx4g''' does not seem like alot of space.

    --assembly-region-padding 1000 \
    -L "chr1:26644000-26646000";

@droazen Is the seqence size being attempted "1000x2 + 2000 = 4000 " or or is more like "1000x2 + 151 = 2151"

droazen commented 4 years ago

@Kmannth With the --assembly-region-padding 1000 argument alone, each region will be a maximum of 300 bases, with 1000 bases of padding on either side. So, 2300 bases maximum for the padded assembly region size. The reads themselves appear to be 151 bases in this case.

nh13 commented 4 years ago

@droazen I think we'll want to set --max-assembly-region-size to something large too. Does this "max" apply before or after padding?

pkolakow commented 3 years ago

This issue has been fixed here https://github.com/Intel-HLS/GKL/pull/142. The patch for this bug will come out with the next release.

broadinstitute / gatk

HaplotypeCaller native SmithWaterman: core dumps and JVM errors #6733