I have been trying to checkpoint and restart an MPI application on two nodes (8 processes total). I had resolved different errors due to wrong paths, versions, and flags until I get a segfault.
Here are the versions that I am currently using:
• MPI: OpenMPI (OpenRTE) 2.1.2
• GCC: 4.8
• DMTCP: 3.0
For the batch submission script (SLURM), I used the start_coordinator function in the sample file
Then I added below to the script:
I got a bunch of warnings about checkpoint signal number and “Datagram Sockets not supported. Hopefully, this is a short-lived connection!“. But the main issue was with “dlopen failed!” in:
[46000] WARNING at dlwrappers.cpp:76 in dlopen; REASON='JWARNING(ret) failed'
filename = /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/libfabric/lib/prov/libmlx-fi.so
As DMTCP suggested, I set DMTCP_DL_PLUGIN to 0 and got rid of the above warning. However, the application segfaulted in the remote ranks.
Do you have any idea what went wrong? I can provide more logs if needed.
The error printed on the output file of the remote job:
mpirun noticed that process rank 7 with PID 51000 on node r139 exited on signal 11 (Segmentation fault).
I have been trying to checkpoint and restart an MPI application on two nodes (8 processes total). I had resolved different errors due to wrong paths, versions, and flags until I get a segfault. Here are the versions that I am currently using: • MPI: OpenMPI (OpenRTE) 2.1.2 • GCC: 4.8 • DMTCP: 3.0
For the batch submission script (SLURM), I used the start_coordinator function in the sample file Then I added below to the script:
I got a bunch of warnings about checkpoint signal number and “Datagram Sockets not supported. Hopefully, this is a short-lived connection!“. But the main issue was with “dlopen failed!” in:
[46000] WARNING at dlwrappers.cpp:76 in dlopen; REASON='JWARNING(ret) failed' filename = /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/libfabric/lib/prov/libmlx-fi.so
As DMTCP suggested, I set DMTCP_DL_PLUGIN to 0 and got rid of the above warning. However, the application segfaulted in the remote ranks.
Do you have any idea what went wrong? I can provide more logs if needed.