Closed adammoody closed 5 years ago
I'll test that this at least runs on SLURM and LSF w/o node failures this week. Testing with node failures will have to wait until later when someone who can power off nodes can help.
Thanks Adam! I am wondering if it's rather cleaner to install pdsh separately through auto-install.py? It seems a bit hacky to me to have a cmake project calling automake :)
@bnicolae , we could do that. It's in there now because I just copied this over from our SCR cmake which fetches and installs pdsh if needed. Later in the week, I'll make sure it works if pdsh has been installed externally. Then we could have the auto-install script do the work.
It'd also be good to have some support in the case that someone cannot install pdsh for some reason, but that will take more time.
@bnicolae , I moved the pdsh install to the auto-install script. Fixed some other bugs I found while testing. Right now, pdsh is required. A second problem is that cmake is currently finding my system install of pdsh, so I haven't been able to test the side-installed one that the auto-install script built.
I realize that pdsh will not support the same remote command options on all systems, so we'll need to customize this. Currently, it will try mrsh, rsh, and ssh in that order, but not all systems likely even have them all installed, nor allow users to run them against compute nodes if they are there. We'll have to figure out a clean way to deal with this.
I could squash all of this into one commit once we're happy with it, or merge it with Github's squash.
@adammoody are you familiar with Kanif (http://taktuk.gforge.inria.fr/kanif/)? I think it will fix many of the issues you mentioned with pdsh.
Adds scripting for restart-in-place for SLURM and LSF resource managers. Detects down nodes and computes remaining healthy set of nodes. Constructs new launch command excluding down nodes.