Open nh13 opened 4 years ago
@nh13 As a first step I'd suggest filing a ticket against the GKL (https://github.com/Intel-HLS/GKL) so that Intel engineers can have a look (but leave this GATK ticket open so that we can track it here as well).
This will be difficult to debug without a test case that Intel can run to reproduce the issue on their end. Does the crash only occur with this one particular sample, or have you seen it on more than one sample? If you could get to the point where you can reproduce it on a shareable bam snippet, that would obviously maximize the chances of this getting fixed.
Intel is currently (at our request) doing a pass on the GKL with valgrind
to find and fix memory safety issues (https://github.com/Intel-HLS/GKL/issues/107), so we expect the next GKL release to fix a bunch of "use after free"-type errors. Maybe they'll get lucky and fix this one as well. Timeline for the release is within the next ~2-3 months.
After that we've asked them to test the GKL with long reads data, which is also known to trigger crashes like this (https://github.com/Intel-HLS/GKL/issues/105). If the problem in your case is that you've exceeded some hardcoded length limitation, the tests on long reads data might reveal the problem.
Let me try to synthesize a BAM and see if that works. It fails on many different samples (~1000+).
@nh13 Great, that would help a lot. Could you also post your complete HaplotypeCaller
command line?
Fails with command:
gatk \
--java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g" \
--spark-runner LOCAL HaplotypeCaller \
-ERC GVCF \
-I test.bam \
-O out.bam \
-R ucsc.hg19.fasta \
--assembly-region-padding 1000 \
--smith-waterman FASTEST_AVAILABLE \
-L "chr1:26644000-26646000";
Changing --smith-waterman FASTEST_AVAILABLE
to --smith-waterman JAVA
works just fine.
@droazen I posted the complete command line I used (the version is above). I posted a test.sam that reproducibly fails on my machine (OSX). And below is the log from my machine:
22:42:22.298 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/nhomer/miniconda3/envs/bfx/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
Aug 01, 2020 10:42:22 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
22:42:22.412 INFO HaplotypeCaller - ------------------------------------------------------------
22:42:22.412 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.1.8.1
22:42:22.412 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
22:42:22.412 INFO HaplotypeCaller - Executing as nhomer@ip-192-168-7-102.ec2.internal on Mac OS X v10.14.6 x86_64
22:42:22.412 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_192-b01
22:42:22.412 INFO HaplotypeCaller - Start Date/Time: August 1, 2020 10:42:22 PM MST
22:42:22.412 INFO HaplotypeCaller - ------------------------------------------------------------
22:42:22.412 INFO HaplotypeCaller - ------------------------------------------------------------
22:42:22.413 INFO HaplotypeCaller - HTSJDK Version: 2.23.0
22:42:22.413 INFO HaplotypeCaller - Picard Version: 2.22.8
22:42:22.413 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
22:42:22.413 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
22:42:22.413 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
22:42:22.413 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
22:42:22.413 INFO HaplotypeCaller - Deflater: IntelDeflater
22:42:22.413 INFO HaplotypeCaller - Inflater: IntelInflater
22:42:22.413 INFO HaplotypeCaller - GCS max retries/reopens: 20
22:42:22.413 INFO HaplotypeCaller - Requester pays: disabled
22:42:22.413 INFO HaplotypeCaller - Initializing engine
22:42:22.705 INFO IntervalArgumentCollection - Processing 2001 bp from intervals
22:42:22.710 INFO HaplotypeCaller - Done initializing engine
22:42:22.712 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
22:42:22.719 INFO NativeLibraryLoader - Loading libgkl_utils.dylib from jar:file:/Users/nhomer/miniconda3/envs/bfx/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_utils.dylib
22:42:22.720 INFO NativeLibraryLoader - Loading libgkl_smithwaterman.dylib from jar:file:/Users/nhomer/miniconda3/envs/bfx/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.dylib
22:42:22.722 INFO SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
22:42:22.724 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
22:42:22.724 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
22:42:22.734 WARN NativeLibraryLoader - Unable to find native library: native/libgkl_pairhmm_omp.dylib
22:42:22.734 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
22:42:22.734 INFO NativeLibraryLoader - Loading libgkl_pairhmm.dylib from jar:file:/Users/nhomer/miniconda3/envs/bfx/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm.dylib
22:42:22.748 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
22:42:22.748 WARN IntelPairHmm - Ignoring request for 4 threads; not using OpenMP implementation
22:42:22.748 INFO PairHMM - Using the AVX-accelerated native PairHMM implementation
22:42:22.751 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "bam". Defaulting to VCF.
22:42:22.776 INFO ProgressMeter - Starting traversal
22:42:22.777 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x000000010f47efd3, pid=96919, tid=0x0000000000002303
#
# JRE version: OpenJDK Runtime Environment (8.0_192-b01) (build 1.8.0_192-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.192-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C [libgkl_smithwaterman4496658849792952100.dylib+0x1fd3] _Z22smithWatermanBackTrackP10dnaSeqPairiiiiPii+0x3c3
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /private/tmp/hs_err_pid96919.log
#
# If you would like to submit a bug report, please visit:
# http://www.azulsystems.com/support/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
@nh13 Thank you for documenting this. Could you please open a ticket over under https://github.com/Intel-HLS/GKL/issues? We're working on a new release and tracking issues over there.
@mepowers already done (see linked issue above): https://github.com/Intel-HLS/GKL/issues/113
@nh13 Do you see this both on Mac and Linux or just Mac?
~I have not tried it on a Linux machine, does it work on yours?~ It was on a linux machine
@nh13 - @Kmannth and I met today and discussed this ticket. As @droazen mentioned above, we're currently working on a new release, primarily bringing dependencies up to date and addressing a couple reported bugs related to processing SNPs and indels. The current release of GKL was primarily tested on SNPs and indels, and not long reads. We have a few open tickets surrounding long reads. We plan on addressing these in a future release. We will pull your ticket into that body of work.
In the meantime please let us know if you run into GKL issues with short reads, and we will be sure to prioritize for our pending release.
@mepowers to clarify, this is not a “long read”, rather a short read with a long indel. For cases like this, I’d expect if the intel HMM fails, GATK should fall back to the Java one automatically. Or can you detect what’s “too long” and use the Java version?
See the cigar: 70M18D65M16S. It’s a 151bp read, which is standard for illumina.
@nh13 Thank you for clarifying. As of today GKL does not auto-detect when there's a read that's "too long," ie a read length we haven't validated with GKL. We should be able to build that into our pending release. I agree we should also make sure that if the GKL pairHMM fails, the JAVA version is called instead. @Kmannth @droazen let's discuss this in our next sync.
Thanks for considering the request and let me know if I can help.
@nh13 It was decided today would would work to make GLK defensive with this issue as a first step.
Right now: smithwaterman_common.h:#define MAX_SEQ_LEN 1024 I think this may be part of the issue but '''-Xmx4g''' does not seem like alot of space.
--assembly-region-padding 1000 \
-L "chr1:26644000-26646000";
@droazen Is the seqence size being attempted "1000x2 + 2000 = 4000 " or or is more like "1000x2 + 151 = 2151"
@Kmannth With the --assembly-region-padding 1000
argument alone, each region will be a maximum of 300 bases, with 1000 bases of padding on either side. So, 2300 bases maximum for the padded assembly region size. The reads themselves appear to be 151 bases in this case.
@droazen I think we'll want to set --max-assembly-region-size
to something large too. Does this "max" apply before or after padding?
This issue has been fixed here https://github.com/Intel-HLS/GKL/pull/142. The patch for this bug will come out with the next release.
Hey folks,
I have a test dataset that interestingly core-dumps or JVM errors with
--smith-waterman FASTEST_AVAILABLE
but not with--smith-waterman JAVA
. The only thing I can think of is somehow Intel's HMM has a length limitation, as I am using--assembly-region-padding 1000
to GATK to call 100-1000bp indels (and it works!). I cannot share the test BAM unfortunately. What can I do to help debug further?I'm using
gatk4-4.1.8.1-0
fromconda create -n debug-gatk4 -c defaults -c conda-forge -c bioconda gatk4
.First error motif:
Second error motif: