Consider support for Slurm inside singularity and shifter containers

Summary

At NERSC, testing of the gen3_workflow is confined to CVMFS due to the lack of Slurm support within our Shifter images. There is also ongoing effort at Cambridge to run the workflow using Singularity images rather than be forced to set up the software locally. The ultimate goal is to allow the parsl workflow to use our images and allow the use of Slurm for batch submission. Ben and Tom have achieved this with their gen2 Run2.2i DR2/Run3.1i DR3 parsl workflow - but the use of Shifter is not known to the workflow or even Slurm, rather the parsl workers happen to run their commands in a shifter container.

There are SPANK plugins to Slurm for both Shifter and Singularity. To use them requires setting up sbatch scripts appropriately to use specific images. I could imagine a modified SlurmProvider that sets up the SBATCH commands to submit a job utilizing a container, such as: #SBATCH --image=docker:yourRepo/yourImage:latest. Maybe that is enough to start.

Alternatively, we can seek to install Slurm inside our images, which would allow the submit side code where the workflow python script runs, to run inside the container. It might be nice to have the same environment for both the submit side and the parsl-executed tasks. It would be worthwhile to talk this through more with the Parsl developers, esp Ben Clifford, to see how beneficial this might be. One suggested example could be enabling the workflow itself to interact with the Butler to determine what parsl tasks are started and their configuration.

Shifter

I have spent some time looking at installing Slurm inside Shifter and reached out to NERSC specifically. The Shifter developers addressed this question directly in their documentation, noting that submitting jobs from within containers is not enabled. NERSC also has some dedicated Parsl documentation where they walk through examples, without Shifter.

After chatting with Brian Van Klaveren, I created a new LSST Science Pipelines docker image based on opensuse/leap:15.2 which seems closest to the OS at NERSC. The source build of the LSST Science Pipelines was successful. The next step would be to install Slurm using the same version and general configuration at NERSC.

Then stumbled upon this doc concerning installing NERSC libraries within a Shifter image. I have attempted to utilize their script to gather NERSC's srun environment and unfortunately, have run into the problem that scanelf (in their shifterize.sh script) is not available. My attempts to install pax-utils (which includes scanelf) locally using zypper have failed. So I'm not so sure this avenue will work, and given NERSC's reluctance concerning installing Slurm inside the containers, I'm hesitant to reach out to NERSC support. Now located the scanelf source code and will try that out.

Went back to install slurm into the opensuse/leap:15.2 image results in some errors: Failed to connect to bus: No such file or directory. Found an issue that seems related here, where they worked around it by using ssh to submit their singularity jobs. Just a note that in the case of Shifter images, sshd is disabled unless you turn on the --ccm flag. More work is necessary to get the image set up appropriately.

A Shifter developer also responded to my NERSC ticket:

My understanding is that the submission clients and configuration have to be closely matched to the server. So if Slurm were just installed with standard RPMs for example, there could be protocol mismatches. I've toyed with an idea of how to work around this. It would involve having some light-weight daemon running outside the container listening to a socket file inside the container. This would allow requests to cross the boundary. So the Slurm clients would still be provided by the system. This approach could even be extended to allow running containers in containers which is not possible with Shifter at the moment. The only reason I haven't pursued this yet is just time. If there is a strong interest, I could revisit this and tried to carve out some cycles.

If we wish to pursue this, I think it would be helpful to involve the Parsl team as they have already interacted with NERSC to develop that documentation linked above and see if the Shifter developers can be persuaded this is worth their effort.

I think for the short-term, we update the SlurmProvider to use the Shifter images and look into getting some help from the Shifter developers on this front.

Singularity

Little more optimistic about this path, but have not pursued it myself. Some discussion was found in a web search.

It would be interesting to hear if James Perry and the folks at Cambridge have some thoughts on this. We could create a docker image (or Singularity directly) that is based on an OS that is closer to what is running at Cambridge. What OS would be recommended?

To Do

Here are a few notes related to parsl / the main issue text at time of writing:

the current parsl implementation supports adding arbitrary SBATCH options to submitted slurm jobs, without needing any parsl code change. This is done in the gen2 workflow already to set node type and queue/qos: https://github.com/LSSTDESC/ImageProcessingPipelines/blob/13bd540a8578ce22bdd23d0103047ae748bf24b0/workflows/parsl/dr2Cfg.py#L63
parsl runs some of its own code inside the slurm job (the parsl process worker pool). Because of that, the environment directly inside the slurm job needs to have the same parsl version as used in the submitting side running the main workflow body. At present, that "same environment" is a conda environment initialised on the submit side before running, and inherited (I think) inside the slurm job by slurm magic.
Moving to running a container directly inside the slurm job creates some tension with point 2, if the submitting side is not also running from the same image.
I like the idea of not having two separate universes: the "has parsl" universe (currently conda) and the "has lsst tooling" universe (current an image) - in the gen2 workflow, there are interactions that happen between lsst stuff and the submit side workflow to do with discovering what work needs running, but that is constrained to calling out to executables and sharing on-disk files
the part of the parsl code that submits to slurm is pretty cleanly separated and hackable, so if python-expressible weird hoops have to be jumped through inside a submit-side container to submit, then I am not particularly scared.

I did quite a bit of poking around and trying to hack a solution, but I ultimately failed due to either a DNS issue in configuring slurm or some other issue with regards to the slurm configuration that is masked as a DNS issue.

tl;dr: It's not as simple as getting /etc/slurm and /opt/esslurm, which where slurm is actually installed in a login node, into a container, even though /opt/esslurm has very few dynamic libraries linked in (really only liblz4)

The first thing I did was investigate /etc/slurm and what was linked in the slurm commands, after I found them in /opt/esslurm:

bvan@cori11:/opt/esslurm> ldd bin/srun 
    linux-vdso.so.1 (0x00002aaaaaad3000)
    libz.so.1 => /lib64/libz.so.1 (0x00002aaaaacd3000)
    liblz4.so.1 => /usr/lib64/liblz4.so.1 (0x00002aaaaaeea000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab0ff000)
    libslurmfull.so => /opt/esslurm/lib64/slurm/libslurmfull.so (0x00002aaaab303000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x00002aaaab70a000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaab921000)
    libc.so.6 => /lib64/libc.so.6 (0x00002aaaabb40000)
    /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
bvan@cori11:/opt/esslurm> ldd bin/squeue 
    linux-vdso.so.1 (0x00002aaaaaad3000)
    libslurmfull.so => /opt/esslurm/lib64/slurm/libslurmfull.so (0x00002aaaaacd3000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab0da000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x00002aaaab2de000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaab4f5000)
    libc.so.6 => /lib64/libc.so.6 (0x00002aaaab714000)
    /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
bvan@cori11:/opt/esslurm> ldd bin/sacct
    linux-vdso.so.1 (0x00002aaaaaad3000)
    libslurmfull.so => /opt/esslurm/lib64/slurm/libslurmfull.so (0x00002aaaaacd3000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab0da000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x00002aaaab2de000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaab4f5000)
    libc.so.6 => /lib64/libc.so.6 (0x00002aaaab714000)
    /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)

From there I tried getting them into a shifter or udocker container in various ways, and I was able to get them into appropriate places after a bit of hacking. shifter does a lot of monkeying to prevent you from modifying things in /etc but most of it is trivially defeated (make a directory /etc/slurm in your image, copy files to disk outside of container and mount that as a volume into the container, copy the mounted things to /etc/slurm).

From there, it seems like either /etc/slurm is only half the config and I need to get the rest of it from a server that I can't get a DNS lookup to for some reason, or mostly just the second part (it all comes from a server). I tried a bit with udocker and I had similar DNS issues.

The errors look like this:

sh-4.4$ ./sacct
sacct: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
sacct: error: fetch_config: DNS SRV lookup failed
sacct: error: _establish_config_source: failed to fetch config
sacct: fatal: Could not establish a configuration source

I tried running with leap 15 (opensuse-based image) and I installed libdns_sn and had issues. There may be some kind of missing file around. I did that based on these out of a hunch:

bvan@cori11:~> zypper packages --installed-only | grep resol
i  | nersc                                           | rubygem-resolve-hostname                  | 0.1.0-1                                             | x86_64
v  | nersc                                           | rubygem-resolve-hostname                  | 0.1.0-0                                             | x86_64
i  | sle-15-module-basesystem                        | xerces-j2-xml-resolver                    | 2.11.0-2.39                                         | noarch
bvan@cori11:~> zypper packages --installed-only | grep dns  
i  | sle-15-product-sles_updates                     | libdns1605                                | 9.16.6-12.32.1                                      | x86_64
i+ | sle-15-product-sles_updates                     | libdns_sd                                 | 0.6.32-5.8.1                                        | x86_64
v  | sle-15-module-basesystem_updates                | libdns_sd                                 | 0.6.32-5.5.3                                        | x86_64
v  | sle-15-module-basesystem                        | libdns_sd                                 | 0.6.32-3.7                                          | x86_64

If it is DNS there may could be some amount of glibc/plugin redirection and it will be hard to mimic without copying the nersc environment in more detail in the container, and if we need to talk to a local socket then that probably won't work either because shifter disallows some directories from being mounted in the container (udocker doesn't care though).

There's references to /var/run files in /etc/slurm/slurm.conf which don't actually exist on the login nodes, and it seems like the pid files don't exactly match (though maybe they exist on worker nodes? I don't know). Maybe doing strace, if possible, in the container might help out understanding the dns lookup failure, but I would rather not bother.

All of that said, it would be possible to run a daemon somewhere at NERSC which just accepts incoming environment and srun commands from other jobs, sets environment and executes the jobs. With that, you inject site-specific srun commands into a container at runtime which just forwards to the daemon. The daemon should be protected by a secret. This would not require ssh.

@brianv0 's mention of ssh made me realise, parsl already has code to run the various slurm commands over a (persistent) ssh connection, so it could ssh out from the container to the enclosing host is that is useful. It needs to run both sbatch and squeue (the latter so it can see what is happening with the jobs that it has previously submitted). I'm not advocating that as a thing to do though.

Just wanted to send along an update. One of the Shifter developers responded to our request and has created a service that can run outside a Shifter container and accept commands, such as slurm commands available on Cori. I tried it with the simple example and it does indeed seem to work. Might be interesting to try this with the gen3 workflow to see if it can be used to submit jobs. Presumably if this goes well, this feature may be added to Shifter directly so we wouldn't have to do these preliminary steps as outlined in the README: https://github.com/scanon/container_proxy

LSSTDESC / gen3_workflow