Open cw646 opened 3 years ago
I noticed when enabling load balancing, we get the following error at the end of the mpm iteration:
Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa790bbc038)
==== backtrace (tid: 695966) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7f6f69703524]
1 /lib64/libucs.so.0(+0x290cd) [0x7f6f697060cd]
2 /lib64/libucs.so.0(+0x292aa) [0x7f6f697062aa]
==== backtrace (tid: 695965) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fb40ebfa524]
1 /lib64/libucs.so.0(+0x290cd) [0x7fb40ebfd0cd]
2 /lib64/libucs.so.0(+0x292aa) [0x7fb40ebfd2aa]
3 /lib64/libpthread.so.0(+0x141e0) [0x7fb4b3a2a1e0]
4 /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7fb4b3b75313]
5 /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7fb4b3b7867c]
6 ./mpm() [0x414a74]
7 /lib64/libc.so.6(__libc_start_main+0xf2) [0x7fb4a3f6b1e2]
8 ./mpm() [0x417b1e]
=================================
3 /lib64/libpthread.so.0(+0x141e0) [0x7f700e5331e0]
4 /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7f700e67e313]
5 /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7f700e68167c]
6 ./mpm() [0x414a74]
7 /lib64/libc.so.6(__libc_start_main+0xf2) [0x7f6ffea741e2]
==== backtrace (tid: 695967) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fa808fa2524]
1 /lib64/libucs.so.0(+0x290cd) [0x7fa808fa50cd]
2 /lib64/libucs.so.0(+0x292aa) [0x7fa808fa52aa]
3 /lib64/libpthread.so.0(+0x141e0) [0x7fa8addd21e0]
4 /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7fa8adf1d313]
5 /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7fa8adf2067c]
6 ./mpm() [0x414a74]
7 /lib64/libc.so.6(__libc_start_main+0xf2) [0x7fa89e3131e2]
8 ./mpm() [0x417b1e]
=================================
8 ./mpm() [0x417b1e]
=================================
[caee-userk:695964:0:695964] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f0112f42038)
==== backtrace (tid: 695964) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7f018b328524]
1 /lib64/libucs.so.0(+0x290cd) [0x7f018b32b0cd]
2 /lib64/libucs.so.0(+0x292aa) [0x7f018b32b2aa]
3 /lib64/libpthread.so.0(+0x141e0) [0x7f02301581e0]
4 /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7f02302a3313]
5 /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7f02302a667c]
6 ./mpm() [0x414a74]
7 /lib64/libc.so.6(__libc_start_main+0xf2) [0x7f02206991e2]
8 ./mpm() [0x417b1e]
=================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 695964 RUNNING AT caee-userk
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 695965 RUNNING AT caee-userk
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 695966 RUNNING AT caee-userk
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 695967 RUNNING AT caee-userk
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
This maybe the cause or a side-effect. However, the result looks good in 2D and doesn't crash:
Step 1:
Step 2:
Describe the bug When using multiple MPI nodes, the load balancing step will cause a crash.
To Reproduce Steps to reproduce the behavior:
Expected behavior Should not cause the simulation to halt.
Runtime environment (please complete the following information):