dmtcp / dmtcp

DMTCP: Distributed MultiThreaded CheckPointing
http://dmtcp.sourceforge.net/
Other
375 stars 133 forks source link

API for remotely controlling a DMTCP coordinator #174

Open uselessd opened 8 years ago

uselessd commented 8 years ago

I'm aware of the libdmtcp APIs which can be used to set up a dmtcp_event_hook checkpoint handler for asynchronously reacting to DMTCP preemptions, including the ability to retrieve key-value pairs from a coordinator store. Then, of course, a program can define its own checkpoint strategies by manually adding DMTCP points.

However, how would one go about actually remotely controlling the DMTCP coordinator itself, commanding which processes to checkpoint, issuing restarts, rotating checkpoint files and so forth? Is there a formal API at all, or should I simply write a wrapper around invoking it from the command line and marshal the output to whatever data format is relevant?

karya0 commented 8 years ago

Right now, the only way is to use dmtcp_command to query the coordinator. Further, it's not possible to checkpoint select processes. Checkpoint happens for the entire computation. If a finer granularity is required, one possibility is to use separate coordinators for each set of processes that need to be checkpointed as a group.

uselessd commented 8 years ago

Hm, I suppose that does reflect the HPC heritage. I will experiment with a synchronization scheme using multiple coordinators, thanks.

karya0 commented 8 years ago

Out of curiosity, what is the use case for checkpointing only a select number of processes out of the entire computation?

uselessd commented 8 years ago

Well, it's really the categorization of the unit of work (per-process v. per-computation). Per-process checkpointing could be useful for live upgrading and hot restarting system services, for instance.