Closed shubhammehta03 closed 11 months ago
I don't see the error you are mentioning, I do see the following:
slurmstepd: error: Could not run slurm task_prolog [...]: No such file or directory
Please share the logs as text, it is easier to look at.
Also make sure pyxis is installed on the login node and all compute nodes.
pyxis is installed on login node and GPU node. Please see the attached picture. It comes after I submit a container on GPU node. If I normally go to GPU node, srun --help shows container info.
Only a single error is coming in logs, which is 'error: Warning: SPANK plugin "pyxis" option "container-image" not found'
Not sure what's going on then, make sure that Slurm is configured to use pyxis on all nodes too, e.g.:
$ cat /etc/slurm/plugstack.conf.d/pyxis.conf
required /usr/lib/x86_64-linux-gnu/slurm/spank_pyxis.so
Yes, it is configured. As pyxis is working when slurmd -Dvv is manually invoked on GPU node.
It could be an issue with security settings blocking access to Slurm files. Is SELinux enabled? Can you try temporarily disabling it?
Selinux is already disabled as it is trusted in trusted zone.
Ok, since the problem seems to be with Slurm and SPANK in general (not specific to pyxis), you should maybe file an issue against the SchedMD bug tracker if you have a support agreement in place.
Any progress about this problem? I've also met this problem. Thanks
@yonglianglan could you file a new bug please and share your logs please? The situation might be different.
Hi, I have been facing an issue in which when trying to submit a job of enroot container through slurm, the job get allocated a GPU but the container do not initialize, showing the error:
error: Warning: SPANK plugin "pyxis" option "container-image" not found
When we try to invoke slurmd -Dvv command manually on the GPU node, then submit the job, same command works and the container get created.
cluster information: slurm version: 20.11.8 OS version: centos 7 pyxis version: 0.7.0 enroot version: tried 3.3.1 and 3.4
I have attached snapshot for the same.