Running srun, sbatch, salloc from within pyxis

itzsimpl commented 3 years ago

I would like to use Enroot containers to provide toolchain environments for Slurm, i.e. as a sort of substitute for lmod modules. A typical example are NVIDIA Container images which can contain source code with multiple steps. My question is, is it possible to generate Slurm jobs from within a pyxis/enroot container?

flx42 commented 3 years ago

It might be possible, but honestly I haven't tried.

You will likely need to have the same Slurm version inside the container than on the cluster (or bind-mount binaries/libraries). You want to run as non-remapped root, and you might need to bind-mount some more files from the host (I don't think slurmd uses a UNIX domain socket, so at least it should be fine on this side).

If it fails initially, using strace might help to discover which files srun/sbatch/salloc are trying to open inside the container environment.

itzsimpl commented 3 years ago

I did a quick test, but at the moment it seems unfeasible, as one needs to bind mount too many things. For example:

/etc/passwd
/etc/slurm/
/usr/local/lib/slurm
/usr/local/bin
/usr/lib/x86_64-linux-gnu/libmunge.so.2
/var/run/munge/munge.socket.2

With this I was able to at least run sinfo, but I stopped there.

Is there any plan to add official support?

One connected question, is there a plan to support passing the enroot container to sbatch?

flx42 commented 3 years ago

Is there any plan to add official support?

No, not right now, sorry. Because most of the work can be done in the container image (e.g. by installing the same stack / scripts inside the container image). You could also do a custom enroot hook to mount everything that is needed, I don't think it should be done by pyxis.

One connected question, is there a plan to support passing the enroot container to sbatch?

This has been a requested a few times, so we are considering it. I can't tell you for sure if it will happen, or when.

Thanks.

itzsimpl commented 3 years ago

Could you clarify on what you mean by e.g. by installing the same stack / scripts inside the container image?

Support for sbatch would be really awesome, as it is the only command to support the --array parameter, and many existing scripts use it.

One highly specific example (which I am playing with, just to give some perspective) is the Kaldi toolkit (https://github.com/kaldi-asr/kaldi). Sure, one can run it from inside the container image (https://ngc.nvidia.com/catalog/containers/nvidia:kaldi) started with a single srun command which requested resources (cpus+gpus) for the entire duration of the run.

I would say this is not good practice as during training for half of the time only cpus are in use. Most of the scripts, however, have already been written to support gridEngine/slurm, they generate srun or sbatch commands. So, to take full advantage of the cluster and not hog resources when not necessary one would need to be able to run srun from inside an Enroot container (to place a subtask into the queue) or be able to pass the --container-image to sbatch and run the top script from the shell.

flx42 commented 3 years ago

Could you clarify on what you mean by e.g. by installing the same stack / scripts inside the container image?

I mean that you could craft a custom container image with the same Slurm libraries, binaries and configuration than the one you install on your cluster. I guess your Slurm version doesn't change often so it might be fine.

So, to take full advantage of the cluster and not hog resources when not necessary one would need to be able to run srun from inside an Enroot container (to place a subtask into the queue) or be able to pass the --container-image to sbatch and run the top script from the shell.

I see, we have similar use cases but we took a different approach: the sbatch script uses srun --container-image to run the containerized task, and if it needs to schedule a follow-up job, it will do that after this job has completed, for instance with sbatch --dependency=afterok:${SLURM_JOB_ID} next_task.sh

3XX0 commented 3 years ago

FWIW the following is an enroot config that should do the job (I used it in the past), you can convert it to enroot system configuration files and have SLURM be injected automatically in all your containers.

readonly srun_cmd=$(command -v srun)
readonly slurm_conf="/etc/slurm/slurm.conf"
readonly slurm_plugin_dir=$(scontrol show config | awk '/PluginDir/{print $3}')
readonly slurm_plugstack_dir="/etc/slurm/plugstack.conf.d"
readonly slurm_user=$(scontrol show config | awk '/SlurmUser/{print $3}')
readonly libpmix_path=$(ldconfig -p | awk '/libpmix/{print $4; exit}')
readonly libhwloc_path=$(ldconfig -p | awk '/libhwloc/{print $4; exit}')
readonly libmunge_path=$(ldconfig -p | awk '/libmunge/{print $4; exit}')
readonly munge_sock_path=$(awk -F= '/AccountingStoragePass/{print $2}' "${slurm_conf}")

mounts() {
   echo "${srun_cmd} ${srun_cmd}"
   echo "${slurm_conf%/*} ${slurm_conf%/*}"
   echo "${slurm_plugin_dir} ${slurm_plugin_dir}"
   awk '{print $2" "$2}' "${slurm_plugstack_dir}"/*
   echo "${libpmix_path} ${libpmix_path%.*}"
   echo "${libhwloc_path} ${libhwloc_path}"
   echo "${libmunge_path} ${libmunge_path}"
   echo "${munge_sock_path} ${munge_sock_path}"
}

environ() {
   echo "LD_LIBRARY_PATH=${libmunge_path%/*}:${libpmix_path%/*}:${libhwloc_path%/*}"
   env | grep SLURM || :
}

hooks() {
   getent passwd "${slurm_user%(*}" >> ${ENROOT_ROOTFS}/etc/passwd
}

dr-br commented 2 years ago

We are very happy with #55, but now users want to do multinode jobs with sbatch. @3XX0 could you please comment on https://github.com/NVIDIA/pyxis/issues/31#issuecomment-718928890 how to set that up? Thanks!

flx42 commented 2 years ago

I don't think this is something we can support reliably unless we get https://bugs.schedmd.com/show_bug.cgi?id=12230 OR some kind of API compatibility guarantee OR you build your containers with the same version of Slurm than what is installed on the cluster (i.e. non-portable containers).

tf-nv commented 3 months ago

I do have a use case for srun inside a pyxis container as well. There is a framework which dynamically builds an srun command and launches that. The framework has intricate dependencies, and has to be launched from within a container itself. However srun is not availabe inside that container, so the dynamic srun command can not be launched.

NVIDIA / pyxis

Running srun, sbatch, salloc from within pyxis #31