Open emwjacobson opened 1 month ago
Thank you for the detailed report. I passed this information to our partners at intel who maintain that library
Do you know if this is only happening with this particular input file? (i.e. AlignedCalToCcl_Scaffolds_MarkDupOut.bam
) It's good to know that it's repeatable, but I'm wondering if your user sees the same problem when they run other inputs to the same task. We haven't seen this particular problem so I suspect it might be some confluence of factors in this file that's hitting a freshly discovered edge case.
Another question. Which distribution of the JVM are you running? We use and have sometimes seen issues with other distributions of Java 17.
If you could try running the same job using our standard docker environment that might provide additional information.
I've asked and they seeming do have success with other files.
As for Java, we use OpenJDK downloaded from https://jdk.java.net/
openjdk version "17.0.2" 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)
Because it's a shared cluster, we aren't able to run Docker directly. But I attempted converting it in to a Singularity container and it didn't crash in the same way, but the job did end up failing.
Logs are as follows -
For the "bare metal" known-crashing conditions (AMD-based machine), the final lines of the output are:
22:47:45.999 INFO ProgressMeter - Scaffold_1:21181812 551.0 125350 227.5
22:47:56.192 INFO ProgressMeter - Scaffold_1:21203869 551.1 125450 227.6
22:48:06.937 INFO ProgressMeter - Scaffold_1:21251889 551.3 125650 227.9
22:48:18.177 INFO ProgressMeter - Scaffold_1:21271601 551.5 125750 228.0
22:48:29.896 INFO ProgressMeter - Scaffold_1:21281660 551.7 125810 228.0
22:48:40.223 INFO ProgressMeter - Scaffold_1:21284898 551.9 125830 228.0
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f889b5be310, pid=1422929, tid=1422930
#
# JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)
# Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libc.so.6+0xcf310] __memset_avx2_unaligned_erms+0x60
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /bigdata/operations/ejaco020/gatk/core.1422929)
#
# An error report file with more information is saved as:
# /bigdata/operations/ejaco020/gatk/hs_err_pid1422929.log
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
When running on singularity (AMD-based machine):
07:07:35.120 INFO ProgressMeter - Scaffold_1:21181812 627.1 125350 199.9
07:07:45.271 INFO ProgressMeter - Scaffold_1:21193618 627.2 125400 199.9
07:07:56.027 INFO ProgressMeter - Scaffold_1:21249981 627.4 125640 200.3
07:08:07.701 INFO ProgressMeter - Scaffold_1:21267889 627.6 125730 200.3
07:08:19.031 INFO ProgressMeter - Scaffold_1:21279883 627.8 125800 200.4
07:08:32.466 INFO ProgressMeter - Scaffold_1:21283419 628.0 125820 200.3
Using GATK jar /gatk/gatk-package-4.6.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.6.0.0-local.jar HaplotypeCaller -R /rhome/ejaco020/bigdata/gatk/Cclementina_182_v1_2.fa -I AlignedCalToCcl_Scaffolds_MarkDupOut.bam -O sing_epyc.vcf.gz -ERC GVCF
(No further output)
Something else that I'm noticing, is that on our Intel machines, the crash happens at __memset_avx2_erms+0x11
, though on AMD it crashes at __memset_avx2_unaligned_erms+0x60
. Probably just an architecture thing though.
I also notice that the Singularity container uses a slightly different version of Java, 17.0.9. I'll see about getting/building a newer version of Java and attempting to run gatk and report back with any findings :)
Usually when we see silent failures like what's happening in singularity, it's due to an out of memory error that results in the JVM process being rudely killed before it can output an error message. It's possible that's what's happening there. If you're running in a container with a limited memory pool, you have to be sure to set the java memory explicitly with -Xmx, but also be sure to leave some memory left over for the system and for native code invoked by java. For example, if you have a container with 8G of memory available I would set -Xmx7g to leave a bit of overhead available.
I think trying with a newer release of java 17 is a good idea.
Java 17.0.12 from Oracle seems to display the same behavior.
12:19:27.622 INFO ProgressMeter - Scaffold_1:21175995 247.8 125320 505.8
12:19:49.612 INFO ProgressMeter - Scaffold_1:21178224 248.1 125330 505.1
12:20:02.383 INFO ProgressMeter - Scaffold_1:21179909 248.4 125340 504.7
12:20:14.545 INFO ProgressMeter - Scaffold_1:21183582 248.6 125360 504.4
12:20:25.422 INFO ProgressMeter - Scaffold_1:21255583 248.7 125670 505.2
12:20:36.810 INFO ProgressMeter - Scaffold_1:21281660 248.9 125810 505.4
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f4ad4d94291, pid=3638446, tid=3638447
#
# JRE version: Java(TM) SE Runtime Environment (17.0.12+8) (build 17.0.12+8-LTS-286)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.12+8-LTS-286, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libc.so.6+0xcf291] __memset_avx2_erms+0x11
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /bigdata/operations/ejaco020/gatk/core.3638446)
#
# An error report file with more information is saved as:
# /bigdata/operations/ejaco020/gatk/hs_err_pid3638446.log
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
I also attempted running within a singularity container, allocating 64GB of memory to the job and specifying -Xmx60G. Still seemed to silently "crash".
Command I ran was:
singularity run gatk_4.6.0.0.sif gatk HaplotypeCaller --java-options -Xmx60G -R /rhome/ejaco020/bigdata/gatk/Cclementina_182_v1_2.fa -I AlignedCalToCcl_Scaffolds_MarkDupOut.bam \
-O sing.vcf.gz \
-ERC GVCF
@emwjacobson I'm sorry I don't have a better solution. It seems likely it's some sort of input data specific crash in the GKL. If the user encounters the problem at the same point in the file everytime I would recommend that they work around the crash by excluding the approximate location by using the -XL
argument.
I've reported the issue to our collaborators at Intel but there are currently some structural changes happening on their end so it might not be resolved quickly.
They could run the problematic segment using the native java HMM -pairHMM LOGLESS_CACHING
and then combine that segment back into the other calls.
It's also possible that the non-OMP version of the hmm might not hit the same issue. They could try with pairHMM AVX_LOGLESS_CACHING
set instead of the default ``
* Opening on behalf of a user on an HPC cluster, my knowledge in this field is a bit limited.
Affected tool(s) or class(es)
gatk HaplotypeCaller
Affected version(s)
Latest 4.6.0.0 release
Description
When running command, ~16 hours into the run the program crashes. Below is the start of the Java error report file
Steps to reproduce
The command ran was
Submitted to an HPC cluster using Slurm. Multiple machines tested, one Intel with an Xeon CPU E5-2683 v4 CPU and additionally tested on AMD with an EPYC 7713 CPU.
This has also been run multiple times, all crashing at the same
__memset_avx2_erms+0x11
instruction.Other package versions that might be relevant: java/17.0.2 glibc-common-2.28-225
If any more information is needed from me or the user, please let me know :)