NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

pyxis setup issue #127

Closed vinayburugu closed 8 months ago

vinayburugu commented 8 months ago

Hi I installed pyxis and enroot and facing issues in running srun with *.sqsh image. Can anyone identify the cause for this?

/opt/slurm/bin/srun --container-image hello-world.sqsh hostname

pyxis: importing docker image: hello-world.sqsh slurmstepd: error: pyxis: child 10028 failed with error code: 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [INFO] Querying registry for permission grant slurmstepd: error: pyxis: [INFO] Authenticating with user: slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: compute-0: task 0: Exited with exit code 1

cat /opt/slurm/etc/plugstack.conf include /opt/slurm/etc/plugstack.conf.d/*

cat /opt/slurm/etc/plugstack.conf.d/pyxis.conf required /usr/local/lib/slurm/spank_pyxis.so

flx42 commented 8 months ago

Can you try with --container-image ./hello-world.sqsh?

You have to start the path with / or ./ to use a squashfs image, otherwise it's considered as a docker image from a registry.

vinayburugu commented 8 months ago

I do have the /home/ubuntu/hello-world.sqsh file but still seeing no such file error.

/opt/slurm/bin/srun --container-image /home/ubuntu/hello-world.sqsh hostname
slurmstepd: error: pyxis: child 10261 failed with error code: 1 slurmstepd: error: pyxis: failed to create container filesystem slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [ERROR] No such file or directory: /home/ubuntu/hello-world.sqsh slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: compute-0: task 0: Exited with exit code 1

flx42 commented 8 months ago

That should work, is /home/ubuntu mounted on all nodes?

flx42 commented 8 months ago

If you are running the srun from a Slurm login node, then perhaps the hello-world.sqsh file is only present on the login node and not on the other nodes?

vinayburugu commented 8 months ago

moved the file to shared file system. Still seeing task_init() error. Tried restarting slurmctld and slrumd.

/opt/slurm/bin/srun --container-image /mnt/shared/hello-world.sqsh hostname

slurmstepd: error: pyxis: container start failed with error code: 1 slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: enroot-switchroot: failed to execute: /bin/sh: No such file or directory slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: compute-0: task 0: Exited with exit code 1

flx42 commented 8 months ago

Is it the DockerHub hello-world image? https://hub.docker.com/_/hello-world

If so, it is a FROM scratch image so it just has one binary inside it: https://github.com/docker-library/hello-world/blob/3fb6ebca4163bf5b9cc496ac3e8f11cb1e754aee/amd64/hello-world/Dockerfile

Try an ubuntu image instead.

vinayburugu commented 8 months ago

It worked @flx42 . Thank you