GenomicsDB malloc unaligned tcache chunk error

danagibbon commented 10 months ago

Bug Report

Affected tool(s) or class(es)

Tool/class name(s), special parameters: GenomicsDBImport

Affected version(s)

Version: gatk4-4.4.0.0-0

Description

Hello,

I have been having an issue come up when utilizing GenomicsDBImport. This issue has happened when using a range of samples and shard counts (8 - 1000 samples, shard count of up to 2000). My current example is an attempt to joint call 1000 samples together. I will submit the jobs and 1-2 of the shards (of the ~100 concurrently running) will throw a malloc(): unaligned tcache chunk detected. When I resubmit that shard, it will usually rerun without a problem. Or if I kill all jobs and resubmit, a different shard will throw the malloc error.

I have run approximately 20 tests and I seem to get this failure 2/3 times. However, it only arises on the initial submission and not when additional jobs are submitted as previous shards complete. Please note that the 1000 samples have successfully been imported into the GenomicsDB but this error seems to persist somewhat randomly across multiple machines.

Thank you for your assistance!

Steps to reproduce

Command used (omitting paths to 1000 samples for brevity) for one of the failed shards.

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx8g -jar  /gpfs/gpfs_de6000/home/dalegre/miniconda3/envs/GOASTv4.0/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar GenomicsDBImport -V [samples 1-1002]  --genomicsdb-workspace-path results/jointcalling/genomicsDB/temp_0882_of_2000_DB --merge-input-intervals false --bypass-feature-reader --tmp-dir temp --max-num-intervals-to-import-in-parallel 10 --batch-size 50 --intervals results/germline/interval/temp_0882_of_2000/scattered.interval_list --genomicsdb-shared-posixfs-optimizations true

Expected behavior

All shards are imported into the GenomicsDB successfully.

Actual behavior

Tell us what happens instead

job dies with this error:

malloc(): unaligned tcache chunk detected

23:45:26.793 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/gpfs_de6000/home/dalegre/miniconda3/e
nvs/GOASTv4.0/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
23:45:26.822 INFO  GenomicsDBImport - ------------------------------------------------------------
23:45:26.824 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.4.0.0
23:45:26.824 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
23:45:26.824 INFO  GenomicsDBImport - Executing as dalegre@amd4103.hpc.eu.lenovo.com on Linux v5.14.0-284.11.1.el9_2.x86_64 amd6
4
23:45:26.824 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
23:45:26.824 INFO  GenomicsDBImport - Start Date/Time: February 6, 2024 at 11:45:26 PM CET
23:45:26.824 INFO  GenomicsDBImport - ------------------------------------------------------------
23:45:26.824 INFO  GenomicsDBImport - ------------------------------------------------------------
23:45:26.825 INFO  GenomicsDBImport - HTSJDK Version: 3.0.5
23:45:26.825 INFO  GenomicsDBImport - Picard Version: 3.0.0
23:45:26.825 INFO  GenomicsDBImport - Built for Spark Version: 3.3.1
23:45:26.826 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:45:26.826 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:45:26.826 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:45:26.826 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:45:26.826 INFO  GenomicsDBImport - Deflater: IntelDeflater
23:45:26.827 INFO  GenomicsDBImport - Inflater: IntelInflater
23:45:26.827 INFO  GenomicsDBImport - GCS max retries/reopens: 20
23:45:26.827 INFO  GenomicsDBImport - Requester pays: disabled
23:45:26.827 INFO  GenomicsDBImport - Initializing engine
23:45:46.550 INFO  FeatureManager - Using codec IntervalListCodec to read file file:///gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.4/results/germline/interval/temp_0882_of_2000/scattered.interval_list
23:45:46.584 INFO  IntervalArgumentCollection - Processing 1086188 bp from intervals
23:45:46.586 INFO  GenomicsDBImport - Done initializing engine
23:45:47.489 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.4-ce4e1b9
23:45:47.491 INFO  GenomicsDBImport - Vid Map JSON file will be written to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.4/results/jointcalling/genomicsDB/temp_0882_of_2000_DB/vidmap.json
23:45:47.491 INFO  GenomicsDBImport - Callset Map JSON file will be written to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.4/results/jointcalling/genomicsDB/temp_0882_of_2000_DB/callset.json
23:45:47.491 INFO  GenomicsDBImport - Complete VCF Header will be written to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.4/results/jointcalling/genomicsDB/temp_0882_of_2000_DB/vcfheader.vcf
23:45:47.491 INFO  GenomicsDBImport - Importing to workspace - /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.4/results/jointcalling/genomicsDB/temp_0882_of_2000_DB
malloc(): unaligned tcache chunk detected

lbergelson commented 10 months ago

@nalinigans Any thoughts on this?

nalinigans commented 10 months ago

Almost looks like there is a buffer overrun somewhere. Most of our testing has been on nfs and have not encountered a tcache(thread local cache) issue. Is gpfs available as open source?

danagibbon commented 9 months ago

If it helps, I have seen this error when using local drives exclusively (not attached to a shared file system).

Twice it has manifested as a core dump that points to C [libc.so.6+0xaf4f9] malloc+0x169:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000014cfb1d504f9, pid=1182729, tid=1195264
#
# JRE version: OpenJDK Runtime Environment (17.0.3) (build 17.0.3-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (17.0.3-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0xaf4f9]  malloc+0x169
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.3/core.1182729)
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
[dalegre@login4601 fdone]$ head -n 20 hs_err_pid1182729.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000014cfb1d504f9, pid=1182729, tid=1195264
#
# JRE version: OpenJDK Runtime Environment (17.0.3) (build 17.0.3-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (17.0.3-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0xaf4f9]  malloc+0x169
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /gpfs/gpfs_de6000/home/dalegre/projects/1000-Genomes/jointcalling-test/goast_workflows/JointCalling/test_samples-1000.1.3/core.1182729)
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

nalinigans commented 9 months ago

@danagibbon thanks for this pointer. What versions of gatk have you seen this error on?

danagibbon commented 9 months ago

@nalinigans thank you for the prompt replies! I'm using gatk4-4.4.0.0-0

I will try the latest version next week when our cluster is back online (currently undergoing scheduled maintenance).

nalinigans commented 9 months ago

Thanks @danagibbon, I may know what the issue is. hdfs support in GenomicsDB still relies on JVM/Java 11 and we had some workarounds with thread local caches from a while ago. I will create a branch sometime next week without hdfs which will hopefully get us past this issue.

danagibbon commented 9 months ago

Thank you, much appreciated!!! Have a nice weekend.

nalinigans commented 9 months ago

@danagibbon, here is the branch - https://github.com/broadinstitute/gatk/tree/ng_remove_hdfs_support. Can you build gatk from this branch and try it out please? If the problem still exists, can you attach the core dump file too. Thanks.

broadinstitute / gatk