cb-geo / mpm

CB-Geo High-Performance Material Point Method
https://www.cb-geo.com/research/mpm
Other
235 stars 82 forks source link

Load balancing on multiple nodes causing crash #709

Open cw646 opened 3 years ago

cw646 commented 3 years ago

Describe the bug When using multiple MPI nodes, the load balancing step will cause a crash.

To Reproduce Steps to reproduce the behavior:

  1. Set MPI nodes to >1
  2. Set load-balancing steps to < total number of steps.

Expected behavior Should not cause the simulation to halt.

Runtime environment (please complete the following information):

kks32 commented 3 years ago

I noticed when enabling load balancing, we get the following error at the end of the mpm iteration:

 Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa790bbc038)
==== backtrace (tid: 695966) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7f6f69703524]
 1  /lib64/libucs.so.0(+0x290cd) [0x7f6f697060cd]
 2  /lib64/libucs.so.0(+0x292aa) [0x7f6f697062aa]
==== backtrace (tid: 695965) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fb40ebfa524]
 1  /lib64/libucs.so.0(+0x290cd) [0x7fb40ebfd0cd]
 2  /lib64/libucs.so.0(+0x292aa) [0x7fb40ebfd2aa]
 3  /lib64/libpthread.so.0(+0x141e0) [0x7fb4b3a2a1e0]
 4  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7fb4b3b75313]
 5  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7fb4b3b7867c]
 6  ./mpm() [0x414a74]
 7  /lib64/libc.so.6(__libc_start_main+0xf2) [0x7fb4a3f6b1e2]
 8  ./mpm() [0x417b1e]
=================================
 3  /lib64/libpthread.so.0(+0x141e0) [0x7f700e5331e0]
 4  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7f700e67e313]
 5  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7f700e68167c]
 6  ./mpm() [0x414a74]
 7  /lib64/libc.so.6(__libc_start_main+0xf2) [0x7f6ffea741e2]
==== backtrace (tid: 695967) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fa808fa2524]
 1  /lib64/libucs.so.0(+0x290cd) [0x7fa808fa50cd]
 2  /lib64/libucs.so.0(+0x292aa) [0x7fa808fa52aa]
 3  /lib64/libpthread.so.0(+0x141e0) [0x7fa8addd21e0]
 4  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7fa8adf1d313]
 5  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7fa8adf2067c]
 6  ./mpm() [0x414a74]
 7  /lib64/libc.so.6(__libc_start_main+0xf2) [0x7fa89e3131e2]
 8  ./mpm() [0x417b1e]
=================================
 8  ./mpm() [0x417b1e]
=================================
[caee-userk:695964:0:695964] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f0112f42038)
==== backtrace (tid: 695964) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7f018b328524]
 1  /lib64/libucs.so.0(+0x290cd) [0x7f018b32b0cd]
 2  /lib64/libucs.so.0(+0x292aa) [0x7f018b32b2aa]
 3  /lib64/libpthread.so.0(+0x141e0) [0x7f02301581e0]
 4  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(+0x132313) [0x7f02302a3313]
 5  /home/user/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12(PMPI_Buffer_detach+0x8c) [0x7f02302a667c]
 6  ./mpm() [0x414a74]
 7  /lib64/libc.so.6(__libc_start_main+0xf2) [0x7f02206991e2]
 8  ./mpm() [0x417b1e]
=================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 695964 RUNNING AT caee-userk
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 695965 RUNNING AT caee-userk
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 695966 RUNNING AT caee-userk
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 695967 RUNNING AT caee-userk
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

This maybe the cause or a side-effect. However, the result looks good in 2D and doesn't crash: Step 1: image Step 2: image