NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
266 stars 30 forks source link

pyxis option "container-image" not found through slurm #84

Closed shubhammehta03 closed 11 months ago

shubhammehta03 commented 2 years ago

Hi, I have been facing an issue in which when trying to submit a job of enroot container through slurm, the job get allocated a GPU but the container do not initialize, showing the error:

error: Warning: SPANK plugin "pyxis" option "container-image" not found

When we try to invoke slurmd -Dvv command manually on the GPU node, then submit the job, same command works and the container get created.

cluster information: slurm version: 20.11.8 OS version: centos 7 pyxis version: 0.7.0 enroot version: tried 3.3.1 and 3.4

I have attached snapshot for the same.

image (2) image (1) image

flx42 commented 2 years ago

I don't see the error you are mentioning, I do see the following:

slurmstepd: error: Could not run slurm task_prolog [...]: No such file or directory

Please share the logs as text, it is easier to look at.

flx42 commented 2 years ago

Also make sure pyxis is installed on the login node and all compute nodes.

shubhammehta03 commented 1 year ago

pyxis is installed on login node and GPU node. Please see the attached picture. It comes after I submit a container on GPU node. If I normally go to GPU node, srun --help shows container info.

image

shubhammehta03 commented 1 year ago

Only a single error is coming in logs, which is 'error: Warning: SPANK plugin "pyxis" option "container-image" not found'

flx42 commented 1 year ago

Not sure what's going on then, make sure that Slurm is configured to use pyxis on all nodes too, e.g.:

$ cat /etc/slurm/plugstack.conf.d/pyxis.conf
required /usr/lib/x86_64-linux-gnu/slurm/spank_pyxis.so
shubhammehta03 commented 1 year ago

Yes, it is configured. As pyxis is working when slurmd -Dvv is manually invoked on GPU node.

flx42 commented 1 year ago

It could be an issue with security settings blocking access to Slurm files. Is SELinux enabled? Can you try temporarily disabling it?

shubhammehta03 commented 1 year ago

Selinux is already disabled as it is trusted in trusted zone.

flx42 commented 1 year ago

Ok, since the problem seems to be with Slurm and SPANK in general (not specific to pyxis), you should maybe file an issue against the SchedMD bug tracker if you have a support agreement in place.

yonglianglan commented 1 year ago

Any progress about this problem? I've also met this problem. Thanks

flx42 commented 1 year ago

@yonglianglan could you file a new bug please and share your logs please? The situation might be different.