arbor-sim / arbor

The Arbor multi-compartment neural network simulation library.
https://arbor-sim.org
BSD 3-Clause "New" or "Revised" License
105 stars 59 forks source link

Busyring crashes in Nernst on ARM (GH200 CPU) with non-power-of-two thread counts #2284

Open thorstenhater opened 2 weeks ago

thorstenhater commented 2 weeks ago

Crash sometimes masquerade as MPI crash.

Example of crash in MPI

$ srun --exclusive -A zam -N 1 -n 1 --cpus-per-gpu=17 --gpus=1 --gpus-per-task=1 --gres=gpu:1 bin/busyring input.json
gpu:      yes
threads:  17
mpi:      yes
ranks:    1

start=1720081941
cell stats: 2048 cells; 303110 branches; 2831618 compartments;
#cpu=2048 #gpu=0
#cell=2048 #local=2048 #groups=17
model-init=1720081945
running simulation

  0% |                                                  |             0ms[1720081945.285889] [jpbot-001-20:559978:0]
        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[jpbot-001-20:559978] *** Process received signal ***
[jpbot-001-20:559978] Signal: Segmentation fault (11)
[jpbot-001-20:559978] Signal code: Address not mapped (1)
[jpbot-001-20:559978] Failing at address: 0x103bcae285ed0
[jpbot-001-20:559978:0:560001] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bd4e742890)
[jpbot-001-20:559978:1:559978] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3c0310f1d70)
[1720081945.285889] [jpbot-001-20:559978:1]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285927] [jpbot-001-20:559978:2]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285934] [jpbot-001-20:559978:0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285941] [jpbot-001-20:559978:1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285931] [jpbot-001-20:559978:3]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285954] [jpbot-001-20:559978:4]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285959] [jpbot-001-20:559978:5]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285953] [jpbot-001-20:559978:2]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285971] [jpbot-001-20:559978:5]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285964] [jpbot-001-20:559978:6]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[jpbot-001-20:559978:2:559992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bd031d9a50)
[jpbot-001-20:559978:5:559991] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bd099f5b90)
[jpbot-001-20:559978:3:559987] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bc6491bf30)
[jpbot-001-20:559978:6:559988] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bc9653bd90)
[jpbot-001-20:559978:4:560002] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bc925f2e20)
[jpbot-001-20:559978] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffbb9e07f0]
[jpbot-001-20:559978] [ 1] bin/busyring[0x4be968]
[jpbot-001-20:559978] [ 2] bin/busyring[0x4dbbb0]
[jpbot-001-20:559978] [ 3] bin/busyring[0x4fabb0]
[jpbot-001-20:559978] [ 4] bin/busyring[0x460bf0]
[jpbot-001-20:559978] [ 5] bin/busyring[0x467d44]
[jpbot-001-20:559978] [ 6] /p/software/jedi/stages/2024/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xd693c)[0xffffbb6a693c]
[jpbot-001-20:559978] [ 7] /lib64/libc.so.6(+0x80698)[0xffffbb390698]
[jpbot-001-20:559978] [ 8] /lib64/libc.so.6(+0xeabdc)[0xffffbb3fabdc]
[jpbot-001-20:559978] *** End of error message ***
srun: error: jpbot-001-20: task 0: Segmentation fault (core dumped)

Same testcase, different number of tasks per GPU

$ srun --exclusive -A zam -N 1 -n 1 --cpus-per-gpu=16 --gpus=1 --gpus-per-task=1 --gres=gpu:1 bin/busyring input.json
gpu:      yes
threads:  16
mpi:      yes
ranks:    1

start=1720081984
cell stats: 2048 cells; 303110 branches; 2831618 compartments;
#cpu=2048 #gpu=0
#cell=2048 #local=2048 #groups=16
model-init=1720081988
running simulation

100% |--------------------------------------------------|            25ms
model-run=1720082064

2563 spikes generated at rate of 0.00975419 ms between spikes

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                      3.433        1826.736
model-run                      76.748          54.316
meter-total                    80.181        1881.051

Different stack trace showing the problem pointing at Arbor:

Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1154ccb34d740)
==== backtrace (tid: 277735) ====
 0 0x00000000004be968 arb::default_catalogue::kernel_nernst::compute_currents()  ???:0
 1 0x00000000004dbbb0 arb::fvm_lowered_cell_impl<arb::multicore::backend>::integrate()  ???:0
 2 0x00000000004fabb0 arb::cable_cell_group::advance()  ???:0
 3 0x0000000000460bf0 std::_Function_handler<void (), arb::threading::task_group::wrap<arb::threading::parallel_for::apply<arb::simulation_state::foreach_group_index<arb::simulation_state::run(double, double)::{lambda(arb::epoch)#2}::operator()(arb::epoch) const::{lambda(std::unique_ptr<arb::cell_group, std::default_delete<arb::cell_group> >&, int)#1}>(arb::simulation_state::run(double, double)::{lambda(arb::epoch)#2}::operator()(arb::epoch) const::{lambda(std::unique_ptr<arb::cell_group, std::default_delete<arb::cell_group> >&, int)#1}&&)::{lambda(int)#1}>(int, int, int, arb::threading::task_system*, arb::simulation_state::foreach_group_index<arb::simulation_state::run(double, double)::{lambda(arb::epoch)#2}::operator()(arb::epoch) const::{lambda(std::unique_ptr<arb::cell_group, std::default_delete<arb::cell_group> >&, int)#1}>(arb::simulation_state::run(double, double)::{lambda(arb::epoch)#2}::operator()(arb::epoch) const::{lambda(std::unique_ptr<arb::cell_group, std::default_delete<arb::cell_group> >&, int)#1}&&)::{lambda(int)#1})::{lambda()#1}> >::_M_invoke()  ???:0
 4 0x00000000004595b4 arb::threading::task_group::wait()  ???:0
 5 0x0000000000553de8 arb::simulation_state::run()  :0
 6 0x0000000000412c48 main()  ???:0
 7 0x0000000000027300 __libc_start_call_main()  ???:0
 8 0x00000000000273d8 __libc_start_main_alias_2()  :0
 9 0x00000000004179b0 _start()  ???:0
=================================
[jpbot-001-03:277735] *** Process received signal ***
[jpbot-001-03:277735] Signal: Segmentation fault (11)
[jpbot-001-03:277735] Signal code:  (-6)
[jpbot-001-03:277735] Failing at address: 0x191c00043ce7
[jpbot-001-03:277735] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffadf007f0]
[jpbot-001-03:277735] [ 1] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x4be968]
[jpbot-001-03:277735] [ 2] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x4dbbb0]
[jpbot-001-03:277735] [ 3] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x4fabb0]
[jpbot-001-03:277735] [ 4] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x460bf0]
[jpbot-001-03:277735] [ 5] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x4595b4]
[jpbot-001-03:277735] [ 6] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x553de8]
[jpbot-001-03:277735] [ 7] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x412c48]
[jpbot-001-03:277735] [ 8] /lib64/libc.so.6(+0x27300)[0xffffad857300]
[jpbot-001-03:277735] [ 9] /lib64/libc.so.6(__libc_start_main+0x98)[0xffffad8573d8]
[jpbot-001-03:277735] [10] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x4179b0]
[jpbot-001-03:277735] *** End of error message ***