dmtcp / dmtcp

DMTCP: Distributed MultiThreaded CheckPointing
http://dmtcp.sourceforge.net/
Other
375 stars 133 forks source link

Dmtcp Launch with simple mpi example does not work #910

Open Wosch96 opened 3 years ago

Wosch96 commented 3 years ago

Hello guys,

I'm trying to use dmtcp on my vm cluster(Centos 7) and want to run an example with mpi. I need to say that I'm only running the yum package versions of dmtcp(2.6.1 if not mistaken) and mpi (1.10.7). The cluster contains 4 nodes and they are connected via ssh. I used a simple hello_world example and can't get it running. I'm running the dmtcp_coordinator on a second terminal and the in the first this execute command:

dmtcp_launch --rm mpirun ./mpiexample

I'm dropping the command with the following error:

mpirun noticed that process rank 0 with PID 54000 on node master exited on signal 11 (Segmentation fault).

Did I use dmtcp wrong or is there a failure in my cluster? Could it be a problem with the older versions? Sorry for asking I'm new to this kind of checkpointing.

Thank you for any help.

xuyao0127 commented 3 years ago

Hi, support for MPI programs is in another repository: mpickpt/mana. MANA is still in active development and many things are subject to change. The best way to checkpoint MPI programs is by checking out the interface7 branch of the MANA repository. Configure with ./configure, then compile with make -j mana. In the bin directory, there scripts like mana_coordinator/mana_launch/mana_restart as wrappers for easier use.

mana_coordinator is the same as dmtcp_coodinator, but runs in daemon mode and suppresses all outputs. You can still use the dmtcp_coodinator if you prefer. mana_launch is similar to dmtcp_launch, but with mpi plugin enabled. To use it, you can use mpirun mana_launch [your program]. mana_restart is the wrapper for dmtcp_restart with mpi plugin enabled. All of these wrapper scripts accept same options as dmtcp if needed.