NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

srun fails when command is not on host #104

Closed AlienYouth closed 1 year ago

AlienYouth commented 1 year ago

I would expect this is an easy answer, but I can't seem to find it. When using pyxis, it appears that the same command must exist on both the host and in the container. For example, this fails- root@login1# srun --ntasks=1 --container-image=login1:5000#bert /opt/conda/bin/python -c 'print('\''hello'\'')' srun: fatal: Can not execute /opt/conda/bin/python

But if I create an empty file on the host- root@login1# touch /opt/conda/bin/python root@login1# chmod 755 /opt/conda/bin/python

It works! root@login1# srun --ntasks=1 --container-image=login1:5000#bert /opt/conda/bin/python -c 'print('\''hello'\'')' pyxis: importing docker image: login1:5000#bert pyxis: imported docker image: login1:5000#bert hello

This issue also presents this way- root@login1# srun --ntasks=1 --container-image=login1:5000#bert ls / pyxis: importing docker image: login1:5000#bert pyxis: imported docker image: login1:5000#bert slurmstepd: error: execve(): /usr/bin/ls: No such file or directory srun: error: gpn00: task 0: Exited with exit code 2

root@login1# srun --ntasks=1 --container-image=login1:5000#bert which ls pyxis: importing docker image: login1:5000#bert pyxis: imported docker image: login1:5000#bert /bin/ls

root@login1# which ls /usr/bin/ls

If I use /bin/ls in the srun command, it works, because both /bin/ls and /usr/bin/ls exist on the login node.

Is this normal behavior? Will I need to have the same files exist on the login node as the commands I am running within the container? Thanks!

flx42 commented 1 year ago

No that's not normal behavior, which version of Slurm is this?

Make sure pyxis is installed on all nodes in your cluster.

Perhaps your container image is doing something unusual too. Try another example like this one:

$ redis-server --version
Command 'redis-server' not found, but can be installed with:
sudo apt install redis-server

$ srun --container-image=redis redis-server --version
pyxis: importing docker image: redis
pyxis: imported docker image: redis
Redis server v=7.0.9 sha=00000000:0 malloc=jemalloc-5.2.1 bits=64 build=a7db51f3965d7ff7
AlienYouth commented 1 year ago

I'm glad to hear this isn't normal! Hoping I can get it squared away then. Slurm version is 21.08.4 Here is the output from the above command (although with the image pulled from the local repo, as the computes do not have direct internet access)- [root@login1 ~]# redis-server --version -bash: redis-server: command not found root@login1# srun --container-image=login1:5000#redis redis-server --version srun: fatal: Can not execute redis-server However this same container can start and run commands that are found both on the login node and the image successfully- [root@login1 ~]# srun --container-image=login1:5000#redis /bin/echo hello pyxis: importing docker image: login1:5000#redis pyxis: imported docker image: login1:5000#redis hello pyxis is installed on all nodes.. but with this failure it doesn't even seem to get past the login node

AlienYouth commented 1 year ago

I should add that this works- [root@login1 ~]# touch /usr/local/bin/redis-server [root@login1 ~]# chmod 755 /usr/local/bin/redis-server [root@login1 ~]# srun --container-image=login1:5000#redis redis-server --version pyxis: importing docker image: login1:5000#redis pyxis: imported docker image: login1:5000#redis Redis server v=7.0.9 sha=00000000:0 malloc=jemalloc-5.2.1 bits=64 build=a7db51f3965d7ff7

AlienYouth commented 1 year ago

Another thing I should note is that slurm is not installed in a default location, it is installed on shared storage. Do I need to change something when I do the make install? I assumed that since it works at some level, the install was fine, but maybe that was a silly assumption. Thanks for responding!

flx42 commented 1 year ago

Did you setup anything related to sbcast? For instance this slurm option: https://slurm.schedmd.com/slurm.conf.html#OPT_BcastParameters

It looks like the kind of error message you would get when srun --bcast is used.

AlienYouth commented 1 year ago

Nope.. it looks like our slurm.conf is pretty basic. And from what I can tell those options would only apply if we had used the --bcast option?

flx42 commented 1 year ago

Possibly, but I suspect that the bcast code path is taken for some reason and thus pyxis is not even invoked here. After using --container-image, try running scontrol show jobs to check if a Slurm job was created, if not then the job was immediately rejected by the srun command-line.

Also try to reproduce the issue without pyxis by trying to use a binary only installed on compute nodes and not installed on the login node.

AlienYouth commented 1 year ago

It does fail immediately without creating a Slurm job. And I do get the same result without pyxis- [root@login1 ~]# srun hostname gpn00 [root@login1 ~]# which hostname /usr/bin/hostname [root@login1 ~]# mv /usr/bin/hostname /usr/bin/hostname.test [root@login1 ~]# srun hostname srun: fatal: Can not execute hostname I assumed this was normal behavior because all of our clusters (or at least the ones I checked) give me the same result, although they were probably all configured the same way. I will dig deeper into our slurm setup and see what I come across.

AlienYouth commented 1 year ago

AHA! Just found this, which verifies the executable before submitting a job- [root@login1 ~]# env | grep -i slurm SLURM_TEST_EXEC=1 Thanks for your guidance, turns out I just had to learn a bit about slurm!