ECP-VeloC / VELOC

Very-Low Overhead Checkpointing System
http://veloc.rtfd.io
MIT License
52 stars 21 forks source link

scripts for restart in place #17

Closed adammoody closed 5 years ago

adammoody commented 5 years ago

Adds scripting for restart-in-place for SLURM and LSF resource managers. Detects down nodes and computes remaining healthy set of nodes. Constructs new launch command excluding down nodes.

adammoody commented 5 years ago

I'll test that this at least runs on SLURM and LSF w/o node failures this week. Testing with node failures will have to wait until later when someone who can power off nodes can help.

bnicolae commented 5 years ago

Thanks Adam! I am wondering if it's rather cleaner to install pdsh separately through auto-install.py? It seems a bit hacky to me to have a cmake project calling automake :)

adammoody commented 5 years ago

@bnicolae , we could do that. It's in there now because I just copied this over from our SCR cmake which fetches and installs pdsh if needed. Later in the week, I'll make sure it works if pdsh has been installed externally. Then we could have the auto-install script do the work.

It'd also be good to have some support in the case that someone cannot install pdsh for some reason, but that will take more time.

adammoody commented 5 years ago

@bnicolae , I moved the pdsh install to the auto-install script. Fixed some other bugs I found while testing. Right now, pdsh is required. A second problem is that cmake is currently finding my system install of pdsh, so I haven't been able to test the side-installed one that the auto-install script built.

I realize that pdsh will not support the same remote command options on all systems, so we'll need to customize this. Currently, it will try mrsh, rsh, and ssh in that order, but not all systems likely even have them all installed, nor allow users to run them against compute nodes if they are there. We'll have to figure out a clean way to deal with this.

adammoody commented 5 years ago

I could squash all of this into one commit once we're happy with it, or merge it with Github's squash.

bnicolae commented 5 years ago

@adammoody are you familiar with Kanif (http://taktuk.gforge.inria.fr/kanif/)? I think it will fix many of the issues you mentioned with pdsh.