Multi node job, only last node runs inside container

kees-closed commented 3 years ago

I'm running the command below and for some reason, only the last node seems to be running inside the container (Ubuntu), the others run on the host itself (RHEL 7). Am I missing a parameter here? I checked the docs, but only single node examples are shown.

[user@tcn1189 ~]$ srun -N 4 -t 20 --slurmd-debug=0 --container-image=/home/user/tmp/enroot/petsc.sqsh --container-name=petsc --container-mounts=/home/user/tmp/enroot/libraries:/mnt,/usr:/host,/etc/slurm:/etc/slurm grep PRETTY /etc/os-release
PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)"
slurmstepd: error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures).
slurmstepd: error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures).
slurmstepd: error: Could not run slurm task_prolog [/nfs/admin/scripts/admin/testcluster/slurm_taskprolog]: No such file or directory
slurmstepd: error: TMPDIR [/scratch-local/user] is not writeable
slurmstepd: error: Setting TMPDIR to /tmp
PRETTY_NAME="Ubuntu 18.04.4 LTS"

flx42 commented 3 years ago

@AquaL1te this doesn't seem right, and I haven't seen this behavior on our clusters. Can you verify that all of these compute nodes has pyxis installed and configured? If that's the case, can you verify if --container-image works on one node at a time?

Thanks!

lukeyeager commented 3 years ago

Also, your task_prolog is probably not going to be compatible with pyxis. You could write an enroot hook to mount your taskprolog dir inside all containers, but there's a good chance you'll be trying to do things in those scripts which might not work in the environment of the container filesystem.

kees-closed commented 3 years ago

Can you verify that all of these compute nodes has pyxis installed and configured?

I made my own spec file a while ago, I guess because there was no RPM available. Apparently I only tested it on one node and then manually applied the ln -s /usr/share/pyxis/pyxis.conf /etc/slurm/plugstack.conf.d/pyxis.conf on only one node but did not update this detail in my spec file. Sorry for the invalid bug.

It works like a charm now.

NVIDIA / pyxis

Multi node job, only last node runs inside container #29