Closed arnoldas500 closed 2 years ago
Are you running the srun
commands from within a salloc
or another srun
? Or is it directly from the login node?
I am running directly from the login node. If I start an interactive session on the compute node, all the enroot commands seem to work properly.
So the srun
commands are therefore running under 2 different Slurm jobs.
Either pyxis is configured with container_scope=job
, to cleanup the named container when the first job finishes, see https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration
Or perhaps a Slurm epilog cleans up named containers after a job completes.
If you use --container-name
from a single Slurm job, it should work in all cases. For instance inside a sbatch script, a salloc session, or an interactive srun session.
So far still no luck. I have updated /etc/slurm/plugstack.conf.d/pyxis.conf to contain container_scope=global. Tried restarting slurm without success. Found in logs message: slurmd.log: error: pyxis: unknown configuration option: container_scope=global
If I take out container_scope=global and restart slurm, everything works fine but container names wont get saved.
Not sure if this is the appropriate behavior but if I try running a sbatch script from the login node or a single srun from the login node I get the "a container with name "myubuntu" does not exist" error.
Now instead if I start an interactive session using srun to get onto the compute node and run the same commands I tired to run before on the login node it works without issues. Is this the expected behavior?
Not deleting the data & runtime paths in 50-lastuserjob-all-enroot-dirs from the epilog.d now saves the images if --container-name is specified. Thank you for the help.
Now instead if I start an interactive session using srun to get onto the compute node and run the same commands I tired to run before on the login node it works without issues. Is this the expected behavior?
Yes, because it's a single Slurm job, so the epilog does not run to cleanup the existing containers.
Not deleting the data & runtime paths in 50-lastuserjob-all-enroot-dirs from the epilog.d now saves the images if --container-name is specified. Thank you for the help.
Sure, thanks for reporting back!
Example: $ srun --container-image=ubuntu:20.04 --container-name=myubuntu true pyxis: importing docker image ... pyxis: creating container filesystem ... pyxis: starting container ...
Now the container name myubuntu should be pre loaded. When trying to use container name "myubuntu" get error below: $ srun --container-name=myubuntu which file pyxis: error: a container with name "myubuntu" does not exist, and --container-image is not set slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: hulk: task 0: Exited with exit code 1