NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
266 stars 30 forks source link

--container-name command line argument wont save or load container on host #66

Closed arnoldas500 closed 2 years ago

arnoldas500 commented 2 years ago

Example: $ srun --container-image=ubuntu:20.04 --container-name=myubuntu true pyxis: importing docker image ... pyxis: creating container filesystem ... pyxis: starting container ...

Now the container name myubuntu should be pre loaded. When trying to use container name "myubuntu" get error below: $ srun --container-name=myubuntu which file pyxis: error: a container with name "myubuntu" does not exist, and --container-image is not set slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: hulk: task 0: Exited with exit code 1

flx42 commented 2 years ago

Are you running the srun commands from within a salloc or another srun? Or is it directly from the login node?

arnoldas500 commented 2 years ago

I am running directly from the login node. If I start an interactive session on the compute node, all the enroot commands seem to work properly.

flx42 commented 2 years ago

So the srun commands are therefore running under 2 different Slurm jobs. Either pyxis is configured with container_scope=job , to cleanup the named container when the first job finishes, see https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration Or perhaps a Slurm epilog cleans up named containers after a job completes.

If you use --container-name from a single Slurm job, it should work in all cases. For instance inside a sbatch script, a salloc session, or an interactive srun session.

arnoldas500 commented 2 years ago

So far still no luck. I have updated /etc/slurm/plugstack.conf.d/pyxis.conf to contain container_scope=global. Tried restarting slurm without success. Found in logs message: slurmd.log: error: pyxis: unknown configuration option: container_scope=global

If I take out container_scope=global and restart slurm, everything works fine but container names wont get saved.

arnoldas500 commented 2 years ago

Not sure if this is the appropriate behavior but if I try running a sbatch script from the login node or a single srun from the login node I get the "a container with name "myubuntu" does not exist" error.

Now instead if I start an interactive session using srun to get onto the compute node and run the same commands I tired to run before on the login node it works without issues. Is this the expected behavior?

arnoldas500 commented 2 years ago

Not deleting the data & runtime paths in 50-lastuserjob-all-enroot-dirs from the epilog.d now saves the images if --container-name is specified. Thank you for the help.

flx42 commented 2 years ago

Now instead if I start an interactive session using srun to get onto the compute node and run the same commands I tired to run before on the login node it works without issues. Is this the expected behavior?

Yes, because it's a single Slurm job, so the epilog does not run to cleanup the existing containers.

Not deleting the data & runtime paths in 50-lastuserjob-all-enroot-dirs from the epilog.d now saves the images if --container-name is specified. Thank you for the help.

Sure, thanks for reporting back!