SWIFTSIM / HBTplus

HBTplus halo finder adapted for the FLAMINGO and COLIBRE simulations
0 stars 0 forks source link

Crash when running with the address sanitizer #36

Open jchelly opened 2 months ago

jchelly commented 2 months ago

When runnning with the gcc address sanitizer enabled using gcc 13 on Cosma the code crashes in various ways. To reproduce:

git clone git@github.com:SWIFTSIM/HBTplus.git
cd HBTplus/testing
git checkout remove_duplicate_particles_in_mergers

Note that master currently fails with the sanitizer due to a known bug, so we check out this fixed branch. Edit compile.sh and add

-DCMAKE_CXX_FLAGS_DEBUG="-g -O3 -fsanitize=address"

to the cmake command then run

bash ./run_test.sh

This will submit a batch job. Keep an eye on test_output/logs/error.err and observe that the code (probably) crashes somewhere around snapshots 6-10 with either a sanitizer error or a failed assert, or possibly both if more than one process crashes. The crash usually happens in the merger tree function call to find descendants but so far I haven't found any problems there. If I put in asserts to bounds check every single array access in that part of the code the asserts only fail if the sanitizer is enabled.

jchelly commented 2 months ago

The instructions above resulted in a crash for me this morning using the current latest commit 478d27dfb80d6c8d6f6bc3e9ed7d788b0ee6251b from the remove_duplicate_particles_in_mergers branch.

jchelly commented 2 months ago

I also get a crash if I checkout commit 065d1f5 , which is before the merger tree code PR was merged in. This segfaults in the unbinding code, but only if the sanitizer is enabled.

VictorForouhar commented 2 months ago

Does it also segfault before we added any of our own changes?

MatthieuSchaller commented 2 months ago

If it's a problem in the assertions playing with #define NDEBUG could help.

jchelly commented 2 months ago

The odd thing is that there are assertions which pass when the sanitizer is not enabled but fail when it is. The asserts in question are mostly array bounds checks so I don't think we want to disable them with NDEBUG.

jchelly commented 2 months ago

If I configure using llvm by putting this in the compile.sh script

module purge
module load gnu_comp/13.1.0 hdf5/1.12.2 openmpi/4.1.4 cmake/3.28.3 llvm/17.0.6

export CC=clang
export CXX=clang++
export OMPI_CC=clang
export OMPI_CXX=clang++

and enable the address sanitizer then it doesn't even start up! I get

AddressSanitizer:DEADLYSIGNAL
=================================================================
AddressSanitizer:DEADLYSIGNAL
=================================================================
==266915==ERROR: AddressSanitizer: SEGV on unknown address 0x000559664a55 (pc 0x000559664a55 bp 0x00000000000c sp 0x7ffd582ae708 T0)
==266914==ERROR: AddressSanitizer: SEGV on unknown address 0x00055708b1dc (pc 0x00055708b1dc bp 0x00000000000c sp 0x7ffc8c745a58 T0)
==266914==The signal is caused by a READ memory access.
==266915==The signal is caused by a READ memory access.
==266915==Hint: PC is at a non-executable region. Maybe a wild jump?
==266914==Hint: PC is at a non-executable region. Maybe a wild jump?

If I use the same config without the sanitizer it runs and hasn't crashed yet.

jchelly commented 1 month ago

I haven't been able to reproduce this crash on Ubuntu 24.04 with the latest gcc and address sanitizer enabled. Not sure if that means anything.

jchelly commented 1 month ago

It also doesn't crash with clang-14 or clang-18 on Ubuntu.