Open zhao-cy opened 1 year ago
The simplest, (IMO this one for initial implementation) way to fake an HPC slurm cluster is as simple as running a single container that runs all the slurm components. We do this with Reproman, see https://github.com/ReproNim/reproman-slurm
And our usage of it: https://github.com/ReproNim/reproman/blob/master/tools/ci/setup-slurm-container.sh
The downside of this option is that it may not fully replicate the experience of a multinode slurm cluster.
For a multinode setup, I found this guide to be very helpful https://medium.com/analytics-vidhya/slurm-cluster-with-docker-9f242deee601. I went through it, and the cluster worked as expected. The limitation of this multinode setup is that there is no "login node", meaning that jobs need to be submitted from the jupyterlab
container.
We could remove the jupyterlab node, and add a container with ssh installed and enabled. IMO this would be our easiest setup to ensure the multinode experience in testing.
For a significantly more advanced setup, the slurm folks sent me a docker-compose
setup that is much more realistic, but much more complex. https://gitlab.com/SchedMD/training/docker-scale-out
I suspect that a docker or podman-compose script (if there are good container images for all of them) is all that we really need to implement simple test clusters, at least for initial implementation. If this proves to be too fragile for real use, we always have the option to test against a kubernetes cluster with a not-yet-implemented fake-hpc-clusters
operator.
I think all of these options (including option 3) would work on circleci, especially on a machine executor. This is very cool
I suspect it would be better to launch containers that are "the real thing" rather than an emulator, but we might also find some value in looking over this (old) SGE emulator code. https://github.com/chaselgrove/sjs
In the CircleCI tests of BABS, having a "fake HPC test cluster of Slurm" would be ideal - @asmacdo may have some initial effort on this.
If not, having tests regarding application of
babs-init
to Slurm will also be great.