Closed AlienYouth closed 1 year ago
No that's not normal behavior, which version of Slurm is this?
Make sure pyxis is installed on all nodes in your cluster.
Perhaps your container image is doing something unusual too. Try another example like this one:
$ redis-server --version
Command 'redis-server' not found, but can be installed with:
sudo apt install redis-server
$ srun --container-image=redis redis-server --version
pyxis: importing docker image: redis
pyxis: imported docker image: redis
Redis server v=7.0.9 sha=00000000:0 malloc=jemalloc-5.2.1 bits=64 build=a7db51f3965d7ff7
I'm glad to hear this isn't normal! Hoping I can get it squared away then.
Slurm version is 21.08.4
Here is the output from the above command (although with the image pulled from the local repo, as the computes do not have direct internet access)-
[root@login1 ~]# redis-server --version
-bash: redis-server: command not found
root@login1# srun --container-image=login1:5000#redis redis-server --version
srun: fatal: Can not execute redis-server
However this same container can start and run commands that are found both on the login node and the image successfully-
[root@login1 ~]# srun --container-image=login1:5000#redis /bin/echo hello
pyxis: importing docker image: login1:5000#redis
pyxis: imported docker image: login1:5000#redis
hello
pyxis is installed on all nodes.. but with this failure it doesn't even seem to get past the login node
I should add that this works-
[root@login1 ~]# touch /usr/local/bin/redis-server
[root@login1 ~]# chmod 755 /usr/local/bin/redis-server
[root@login1 ~]# srun --container-image=login1:5000#redis redis-server --version
pyxis: importing docker image: login1:5000#redis
pyxis: imported docker image: login1:5000#redis
Redis server v=7.0.9 sha=00000000:0 malloc=jemalloc-5.2.1 bits=64 build=a7db51f3965d7ff7
Another thing I should note is that slurm is not installed in a default location, it is installed on shared storage. Do I need to change something when I do the make install? I assumed that since it works at some level, the install was fine, but maybe that was a silly assumption. Thanks for responding!
Did you setup anything related to sbcast
? For instance this slurm option: https://slurm.schedmd.com/slurm.conf.html#OPT_BcastParameters
It looks like the kind of error message you would get when srun --bcast
is used.
Nope.. it looks like our slurm.conf is pretty basic. And from what I can tell those options would only apply if we had used the --bcast option?
Possibly, but I suspect that the bcast code path is taken for some reason and thus pyxis is not even invoked here. After using --container-image
, try running scontrol show jobs
to check if a Slurm job was created, if not then the job was immediately rejected by the srun
command-line.
Also try to reproduce the issue without pyxis by trying to use a binary only installed on compute nodes and not installed on the login node.
It does fail immediately without creating a Slurm job. And I do get the same result without pyxis-
[root@login1 ~]# srun hostname
gpn00
[root@login1 ~]# which hostname
/usr/bin/hostname
[root@login1 ~]# mv /usr/bin/hostname /usr/bin/hostname.test
[root@login1 ~]# srun hostname
srun: fatal: Can not execute hostname
I assumed this was normal behavior because all of our clusters (or at least the ones I checked) give me the same result, although they were probably all configured the same way. I will dig deeper into our slurm setup and see what I come across.
AHA! Just found this, which verifies the executable before submitting a job-
[root@login1 ~]# env | grep -i slurm
…
SLURM_TEST_EXEC=1
Thanks for your guidance, turns out I just had to learn a bit about slurm!
I would expect this is an easy answer, but I can't seem to find it. When using pyxis, it appears that the same command must exist on both the host and in the container. For example, this fails-
root@login1# srun --ntasks=1 --container-image=login1:5000#bert /opt/conda/bin/python -c 'print('\''hello'\'')'
srun: fatal: Can not execute /opt/conda/bin/python
But if I create an empty file on the host-
root@login1# touch /opt/conda/bin/python
root@login1# chmod 755 /opt/conda/bin/python
It works!
root@login1# srun --ntasks=1 --container-image=login1:5000#bert /opt/conda/bin/python -c 'print('\''hello'\'')'
pyxis: importing docker image: login1:5000#bert
pyxis: imported docker image: login1:5000#bert
hello
This issue also presents this way-
root@login1# srun --ntasks=1 --container-image=login1:5000#bert ls /
pyxis: importing docker image: login1:5000#bert
pyxis: imported docker image: login1:5000#bert
slurmstepd: error: execve(): /usr/bin/ls: No such file or directory
srun: error: gpn00: task 0: Exited with exit code 2
root@login1# srun --ntasks=1 --container-image=login1:5000#bert which ls
pyxis: importing docker image: login1:5000#bert
pyxis: imported docker image: login1:5000#bert
/bin/ls
root@login1# which ls
/usr/bin/ls
If I use /bin/ls in the srun command, it works, because both /bin/ls and /usr/bin/ls exist on the login node.
Is this normal behavior? Will I need to have the same files exist on the login node as the commands I am running within the container? Thanks!