dmtcp / dmtcp

DMTCP: Distributed MultiThreaded CheckPointing
http://dmtcp.sourceforge.net/
Other
375 stars 133 forks source link

SLURM-DMTCP-Segfault #830

Open staheri opened 4 years ago

staheri commented 4 years ago

I have been trying to checkpoint and restart an MPI application on two nodes (8 processes total). I had resolved different errors due to wrong paths, versions, and flags until I get a segfault. Here are the versions that I am currently using: • MPI: OpenMPI (OpenRTE) 2.1.2 • GCC: 4.8 • DMTCP: 3.0

For the batch submission script (SLURM), I used the start_coordinator function in the sample file Then I added below to the script:

module load mpi/gcc_openmpi
export PATH=/home/dmtcp-3.0/bin:$PATH
export LD_LIBRARY_PATH=/home/dmtcp-3.0/lib:$LD_LIBRARY_PATH

ulimit -s 10000
start_coordinator
dmtcp_launch --interval 25 --rm mpirun -np $SLURM_NTASKS ./mpi_dmtcp_hello

I got a bunch of warnings about checkpoint signal number and “Datagram Sockets not supported. Hopefully, this is a short-lived connection!“. But the main issue was with “dlopen failed!” in:

[46000] WARNING at dlwrappers.cpp:76 in dlopen; REASON='JWARNING(ret) failed' filename = /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/libfabric/lib/prov/libmlx-fi.so

As DMTCP suggested, I set DMTCP_DL_PLUGIN to 0 and got rid of the above warning. However, the application segfaulted in the remote ranks.

Do you have any idea what went wrong? I can provide more logs if needed.

staheri commented 4 years ago

The error printed on the output file of the remote job: mpirun noticed that process rank 7 with PID 51000 on node r139 exited on signal 11 (Segmentation fault).