broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 592 forks source link

SIGSEGV when running gatk/4.6.0.0 #8988

Open emwjacobson opened 1 month ago

emwjacobson commented 1 month ago

* Opening on behalf of a user on an HPC cluster, my knowledge in this field is a bit limited.

Affected tool(s) or class(es)

gatk HaplotypeCaller

Affected version(s)

Latest 4.6.0.0 release

Description

When running command, ~16 hours into the run the program crashes. Below is the start of the Java error report file

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f06ed243291, pid=1058615, tid=1058616
#
# JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)
# Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0xcf291]  __memset_avx2_erms+0x11
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /bigdata/ramadugulab/luy/SNPcallingBreeding/core.1058615)
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  S U M M A R Y ------------

Command Line: -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 /bigdata/operations/pkgadmin/opt/linux/centos/8.x/x86_64/pkgs/gatk/4.6.0.0/gatk-package-4.6.0.0-local.jar HaplotypeCaller -R /rhome/luy/bigdata/genomes/Cclementina_182_v1_2.fa -I AlignedCalToCcl_Scaffolds_MarkDupOut.bam -O AlignedCalToCcl_Scaffolds.vcf.gz -ERC GVCF

Host: Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 64 cores, 20G, Rocky Linux release 8.8 (Green Obsidian)
Time: Sat Sep 28 04:11:19 2024 PDT elapsed time: 58592.788414 seconds (0d 16h 16m 32s)

---------------  T H R E A D  ---------------

Current thread (0x00007f06e4025b70):  JavaThread "main" [_thread_in_native, id=1058616, stack(0x00007f06edc7a000,0x00007f06edd7b000)]

Stack: [0x00007f06edc7a000,0x00007f06edd7b000],  sp=0x00007f06edbe6458,  free space=18014398509481393k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libc.so.6+0xcf291]  __memset_avx2_erms+0x11
C  [libgkl_pairhmm_omp5311772482084658743.so+0x1500f]  Java_com_intel_gkl_pairhmm_IntelPairHmm_computeLikelihoodsNative._omp_fn.0+0xcf

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 8942  com.intel.gkl.pairhmm.IntelPairHmm.computeLikelihoodsNative([Ljava/lang/Object;[Ljava/lang/Object;[D)V (0 bytes) @ 0x00007f06d563401c [0x00007f06d5633fa0+0x000000000000007c]
J 10003 c2 com.intel.gkl.pairhmm.IntelPairHmm.computeLikelihoods([Lorg/broadinstitute/gatk/nativebindings/pairhmm/ReadDataHolder;[Lorg/broadinstitute/gatk/nativebindings/pairhmm/HaplotypeDataHolder;[D)V (119 bytes) @ 0x00007f06d5bff3e0 [0x00007f06d5bff3a0+0x0000000000000040]
J 6781 c2 org.broadinstitute.hellbender.utils.pairhmm.VectorLoglessPairHMM.computeLog10Likelihoods(Lorg/broadinstitute/hellbender/utils/genotyper/LikelihoodMatrix;Ljava/util/List;Lorg/broadinstitute/hellbender/utils/pairhmm/PairHMMInputScoreImputator;)V (450 bytes) @ 0x00007f06d54f8cc8 [0x00007f06d54f8a00+0x00000000000002c8]
J 10022 c2 org.broadinstitute.hellbender.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeReadLikelihoods(Lorg/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyResultSet;Lorg/broadinstitute/hellbender/utils/genotyper/SampleList;Ljava/util/Map;Z)Lorg/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods; (25 bytes) @ 0x00007f06d5c0cb30 [0x00007f06d5c0b540+0x00000000000015f0]
J 9971 c2 org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(Lorg/broadinstitute/hellbender/engine/AssemblyRegion;Lorg/broadinstitute/hellbender/engine/FeatureContext;Lorg/broadinstitute/hellbender/engine/ReferenceContext;)Ljava/util/List; (2286 bytes) @ 0x00007f06d5bdef08 [0x00007f06d5bdcd60+0x00000000000021a8]
J 10571% c2 org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(Lorg/broadinstitute/hellbender/engine/MultiIntervalLocalReadShard;Lorg/broadinstitute/hellbender/engine/ReferenceDataSource;Lorg/broadinstitute/hellbender/engine/FeatureManager;)V (154 bytes) @ 0x00007f06d5c8e5c0 [0x00007f06d5c8dd20+0x00000000000008a0]
j  org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse()V+83
j  org.broadinstitute.hellbender.engine.GATKTool.doWork()Ljava/lang/Object;+19
j  org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool()Ljava/lang/Object;+34
j  org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs()Ljava/lang/Object;+225
j  org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain([Ljava/lang/String;)Ljava/lang/Object;+14
j  org.broadinstitute.hellbender.Main.runCommandLineProgram(Lorg/broadinstitute/hellbender/cmdline/CommandLineProgram;[Ljava/lang/String;)Ljava/lang/Object;+20
j  org.broadinstitute.hellbender.Main.mainEntry([Ljava/lang/String;)V+22
j  org.broadinstitute.hellbender.Main.main([Ljava/lang/String;)V+8
v  ~StubRoutines::call_stub

siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x00007f06edc39d00

Register to memory mapping:

RAX=0x0 is NULL
RBX=0x00007f06edc39d00: <offset 0x0000000000006d00> in /bigdata/operations/pkgadmin/opt/linux/centos/8.x/x86_64/pkgs/java/17.0.2/lib/libjava.so at 0x00007f06edc33000
RCX=0x0000000000028318 is an unknown value
RDX=0x00007f06edc39d00: <offset 0x0000000000006d00> in /bigdata/operations/pkgadmin/opt/linux/centos/8.x/x86_64/pkgs/java/17.0.2/lib/libjava.so at 0x00007f06edc33000
RSP=0x00007f06edbe6458 points into unknown readable memory: 0x00007f0673c89bc4 | c4 9b c8 73 06 7f 00 00
RBP=0x00007f06edd78f50 is pointing into the stack for thread: 0x00007f06e4025b70
RSI=0x0 is NULL
RDI=0x00007f06edc39d00: <offset 0x0000000000006d00> in /bigdata/operations/pkgadmin/opt/linux/centos/8.x/x86_64/pkgs/java/17.0.2/lib/libjava.so at 0x00007f06edc33000
R8 =0x0000000000004f9a is an unknown value
R9 =0x0000000000000001 is an unknown value
R10=0x00000000000000c3 is an unknown value
R11=0x00007f06e47c9840 points into unknown readable memory: 0x4141474141414143 | 43 41 41 41 41 47 41 41
R12=0x00007f06edc119e0 points into unknown readable memory: 0x0000000000000000 | 00 00 00 00 00 00 00 00
R13=0x00007f06edbe96c0 points into unknown readable memory: 0x00007f06e4f65c50 | 50 5c f6 e4 06 7f 00 00
R14=0x0000000000028318 is an unknown value
R15=0x0000000000005063 is an unknown value

Registers:
RAX=0x0000000000000000, RBX=0x00007f06edc39d00, RCX=0x0000000000028318, RDX=0x00007f06edc39d00
RSP=0x00007f06edbe6458, RBP=0x00007f06edd78f50, RSI=0x0000000000000000, RDI=0x00007f06edc39d00
R8 =0x0000000000004f9a, R9 =0x0000000000000001, R10=0x00000000000000c3, R11=0x00007f06e47c9840
R12=0x00007f06edc119e0, R13=0x00007f06edbe96c0, R14=0x0000000000028318, R15=0x0000000000005063
RIP=0x00007f06ed243291, EFLAGS=0x0000000000010206, CSGSFS=0x002b000000000033, ERR=0x0000000000000007
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f06edbe6458)
0x00007f06edbe6458:   00007f0673c89bc4 7b8f04462509c62f
0x00007f06edbe6468:   8010180048120140 0000c12912a02890
0x00007f06edbe6478:   0460229080441000 ffffffffffffffff
0x00007f06edbe6488:   4a03ed807b023001 3040120080800100

Steps to reproduce

The command ran was

gatk  HaplotypeCaller -R /rhome/luy/bigdata/genomes/Cclementina_182_v1_2.fa -I AlignedCalToCcl_Scaffolds_MarkDupOut.bam \
    -O AlignedCalToCcl_Scaffolds.vcf.gz \
    -ERC GVCF

Submitted to an HPC cluster using Slurm. Multiple machines tested, one Intel with an Xeon CPU E5-2683 v4 CPU and additionally tested on AMD with an EPYC 7713 CPU.

This has also been run multiple times, all crashing at the same __memset_avx2_erms+0x11 instruction.

Other package versions that might be relevant: java/17.0.2 glibc-common-2.28-225

If any more information is needed from me or the user, please let me know :)

lbergelson commented 1 month ago

Thank you for the detailed report. I passed this information to our partners at intel who maintain that library

Do you know if this is only happening with this particular input file? (i.e. AlignedCalToCcl_Scaffolds_MarkDupOut.bam) It's good to know that it's repeatable, but I'm wondering if your user sees the same problem when they run other inputs to the same task. We haven't seen this particular problem so I suspect it might be some confluence of factors in this file that's hitting a freshly discovered edge case.

lbergelson commented 1 month ago

Another question. Which distribution of the JVM are you running? We use and have sometimes seen issues with other distributions of Java 17.

If you could try running the same job using our standard docker environment that might provide additional information.

emwjacobson commented 1 month ago

I've asked and they seeming do have success with other files.

As for Java, we use OpenJDK downloaded from https://jdk.java.net/

openjdk version "17.0.2" 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)

Because it's a shared cluster, we aren't able to run Docker directly. But I attempted converting it in to a Singularity container and it didn't crash in the same way, but the job did end up failing.

Logs are as follows -

For the "bare metal" known-crashing conditions (AMD-based machine), the final lines of the output are:

22:47:45.999 INFO  ProgressMeter -  Scaffold_1:21181812            551.0                125350            227.5
22:47:56.192 INFO  ProgressMeter -  Scaffold_1:21203869            551.1                125450            227.6
22:48:06.937 INFO  ProgressMeter -  Scaffold_1:21251889            551.3                125650            227.9
22:48:18.177 INFO  ProgressMeter -  Scaffold_1:21271601            551.5                125750            228.0
22:48:29.896 INFO  ProgressMeter -  Scaffold_1:21281660            551.7                125810            228.0
22:48:40.223 INFO  ProgressMeter -  Scaffold_1:21284898            551.9                125830            228.0
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f889b5be310, pid=1422929, tid=1422930
#
# JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)
# Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0xcf310]  __memset_avx2_unaligned_erms+0x60
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /bigdata/operations/ejaco020/gatk/core.1422929)
#
# An error report file with more information is saved as:
# /bigdata/operations/ejaco020/gatk/hs_err_pid1422929.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

When running on singularity (AMD-based machine):

07:07:35.120 INFO  ProgressMeter -  Scaffold_1:21181812            627.1                125350            199.9
07:07:45.271 INFO  ProgressMeter -  Scaffold_1:21193618            627.2                125400            199.9
07:07:56.027 INFO  ProgressMeter -  Scaffold_1:21249981            627.4                125640            200.3
07:08:07.701 INFO  ProgressMeter -  Scaffold_1:21267889            627.6                125730            200.3
07:08:19.031 INFO  ProgressMeter -  Scaffold_1:21279883            627.8                125800            200.4
07:08:32.466 INFO  ProgressMeter -  Scaffold_1:21283419            628.0                125820            200.3
Using GATK jar /gatk/gatk-package-4.6.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.6.0.0-local.jar HaplotypeCaller -R /rhome/ejaco020/bigdata/gatk/Cclementina_182_v1_2.fa -I AlignedCalToCcl_Scaffolds_MarkDupOut.bam -O sing_epyc.vcf.gz -ERC GVCF

(No further output)

Something else that I'm noticing, is that on our Intel machines, the crash happens at __memset_avx2_erms+0x11, though on AMD it crashes at __memset_avx2_unaligned_erms+0x60. Probably just an architecture thing though.

I also notice that the Singularity container uses a slightly different version of Java, 17.0.9. I'll see about getting/building a newer version of Java and attempting to run gatk and report back with any findings :)

lbergelson commented 1 month ago

Usually when we see silent failures like what's happening in singularity, it's due to an out of memory error that results in the JVM process being rudely killed before it can output an error message. It's possible that's what's happening there. If you're running in a container with a limited memory pool, you have to be sure to set the java memory explicitly with -Xmx, but also be sure to leave some memory left over for the system and for native code invoked by java. For example, if you have a container with 8G of memory available I would set -Xmx7g to leave a bit of overhead available.

I think trying with a newer release of java 17 is a good idea.

emwjacobson commented 1 month ago

Java 17.0.12 from Oracle seems to display the same behavior.

12:19:27.622 INFO  ProgressMeter -  Scaffold_1:21175995            247.8                125320            505.8
12:19:49.612 INFO  ProgressMeter -  Scaffold_1:21178224            248.1                125330            505.1
12:20:02.383 INFO  ProgressMeter -  Scaffold_1:21179909            248.4                125340            504.7
12:20:14.545 INFO  ProgressMeter -  Scaffold_1:21183582            248.6                125360            504.4
12:20:25.422 INFO  ProgressMeter -  Scaffold_1:21255583            248.7                125670            505.2
12:20:36.810 INFO  ProgressMeter -  Scaffold_1:21281660            248.9                125810            505.4
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f4ad4d94291, pid=3638446, tid=3638447
#
# JRE version: Java(TM) SE Runtime Environment (17.0.12+8) (build 17.0.12+8-LTS-286)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.12+8-LTS-286, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0xcf291]  __memset_avx2_erms+0x11
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /bigdata/operations/ejaco020/gatk/core.3638446)
#
# An error report file with more information is saved as:
# /bigdata/operations/ejaco020/gatk/hs_err_pid3638446.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

I also attempted running within a singularity container, allocating 64GB of memory to the job and specifying -Xmx60G. Still seemed to silently "crash".

Command I ran was:

singularity run gatk_4.6.0.0.sif gatk HaplotypeCaller --java-options -Xmx60G -R /rhome/ejaco020/bigdata/gatk/Cclementina_182_v1_2.fa -I AlignedCalToCcl_Scaffolds_MarkDupOut.bam \    
    -O sing.vcf.gz \    
    -ERC GVCF
lbergelson commented 1 month ago

@emwjacobson I'm sorry I don't have a better solution. It seems likely it's some sort of input data specific crash in the GKL. If the user encounters the problem at the same point in the file everytime I would recommend that they work around the crash by excluding the approximate location by using the -XL argument.

I've reported the issue to our collaborators at Intel but there are currently some structural changes happening on their end so it might not be resolved quickly.

They could run the problematic segment using the native java HMM -pairHMM LOGLESS_CACHING and then combine that segment back into the other calls.

It's also possible that the non-OMP version of the hmm might not hit the same issue. They could try with pairHMM AVX_LOGLESS_CACHING set instead of the default ``