Open danagibbon opened 10 months ago
@nalinigans Any thoughts on this?
Almost looks like there is a buffer overrun somewhere. Most of our testing has been on nfs
and have not encountered a tcache(thread local cache) issue. Is gpfs
available as open source?
If it helps, I have seen this error when using local drives exclusively (not attached to a shared file system).
Twice it has manifested as a core dump that points to C [libc.so.6+0xaf4f9] malloc+0x169
:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x000014cfb1d504f9, pid=1182729, tid=1195264
#
# JRE version: OpenJDK Runtime Environment (17.0.3) (build 17.0.3-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (17.0.3-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libc.so.6+0xaf4f9] malloc+0x169
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.3/core.1182729)
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
[dalegre@login4601 fdone]$ head -n 20 hs_err_pid1182729.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x000014cfb1d504f9, pid=1182729, tid=1195264
#
# JRE version: OpenJDK Runtime Environment (17.0.3) (build 17.0.3-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (17.0.3-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libc.so.6+0xaf4f9] malloc+0x169
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.3/core.1182729)
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
@danagibbon thanks for this pointer. What versions of gatk have you seen this error on?
@nalinigans thank you for the prompt replies! I'm using gatk4-4.4.0.0-0
I will try the latest version next week when our cluster is back online (currently undergoing scheduled maintenance).
Thanks @danagibbon, I may know what the issue is. hdfs
support in GenomicsDB still relies on JVM/Java 11 and we had some workarounds with thread local caches from a while ago. I will create a branch sometime next week without hdfs
which will hopefully get us past this issue.
Thank you, much appreciated!!! Have a nice weekend.
@danagibbon, here is the branch - https://github.com/broadinstitute/gatk/tree/ng_remove_hdfs_support. Can you build gatk from this branch and try it out please? If the problem still exists, can you attach the core dump file too. Thanks.
Bug Report
Affected tool(s) or class(es)
Affected version(s)
Description
Hello,
I have been having an issue come up when utilizing
GenomicsDBImport
. This issue has happened when using a range of samples and shard counts (8 - 1000 samples, shard count of up to 2000). My current example is an attempt to joint call 1000 samples together. I will submit the jobs and 1-2 of the shards (of the ~100 concurrently running) will throw amalloc(): unaligned tcache chunk detected
. When I resubmit that shard, it will usually rerun without a problem. Or if I kill all jobs and resubmit, a different shard will throw the malloc error.I have run approximately 20 tests and I seem to get this failure 2/3 times. However, it only arises on the initial submission and not when additional jobs are submitted as previous shards complete. Please note that the 1000 samples have successfully been imported into the GenomicsDB but this error seems to persist somewhat randomly across multiple machines.
Thank you for your assistance!
Steps to reproduce
Expected behavior
All shards are imported into the GenomicsDB successfully.
Actual behavior
Tell us what happens instead
job dies with this error:
malloc(): unaligned tcache chunk detected