PennLINC / babs

BIDS App Bootstrap (BABS)
https://pennlinc-babs.readthedocs.io
MIT License
5 stars 6 forks source link

[slurm] set up CircleCI tests for application on slurm clusters #35

Open zhao-cy opened 1 year ago

zhao-cy commented 1 year ago

In the CircleCI tests of BABS, having a "fake HPC test cluster of Slurm" would be ideal - @asmacdo may have some initial effort on this.

If not, having tests regarding application of babs-init to Slurm will also be great.

asmacdo commented 1 year ago

Option 1:

The simplest, (IMO this one for initial implementation) way to fake an HPC slurm cluster is as simple as running a single container that runs all the slurm components. We do this with Reproman, see https://github.com/ReproNim/reproman-slurm

And our usage of it: https://github.com/ReproNim/reproman/blob/master/tools/ci/setup-slurm-container.sh

The downside of this option is that it may not fully replicate the experience of a multinode slurm cluster.

Option 2

For a multinode setup, I found this guide to be very helpful https://medium.com/analytics-vidhya/slurm-cluster-with-docker-9f242deee601. I went through it, and the cluster worked as expected. The limitation of this multinode setup is that there is no "login node", meaning that jobs need to be submitted from the jupyterlab container.

Option 2 extended

We could remove the jupyterlab node, and add a container with ssh installed and enabled. IMO this would be our easiest setup to ensure the multinode experience in testing.

Option 3

For a significantly more advanced setup, the slurm folks sent me a docker-compose setup that is much more realistic, but much more complex. https://gitlab.com/SchedMD/training/docker-scale-out

In general

I suspect that a docker or podman-compose script (if there are good container images for all of them) is all that we really need to implement simple test clusters, at least for initial implementation. If this proves to be too fragile for real use, we always have the option to test against a kubernetes cluster with a not-yet-implemented fake-hpc-clusters operator.

mattcieslak commented 1 year ago

I think all of these options (including option 3) would work on circleci, especially on a machine executor. This is very cool

asmacdo commented 1 year ago

More prior art https://github.com/ExaWorks/containerized-testing-environment https://github.com/wilke/psij-compose-testing

asmacdo commented 1 year ago

I suspect it would be better to launch containers that are "the real thing" rather than an emulator, but we might also find some value in looking over this (old) SGE emulator code. https://github.com/chaselgrove/sjs