NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

Error upon invoking container image (failed with rc=-1) #135

Open as7a5 opened 4 months ago

as7a5 commented 4 months ago

Hi, upon invoking the container image as for instance (enroot 3.4.1):

[siavoa01@bigpurple-ln3 superpod]$ srun --container-image ./ubuntu.sqsh -t 00:60:00 --cpus-per-task=20 --tasks-per-node=1 --gpus-per-task=8 --mem=100G --pty bash srun: job 34262460 queued and waiting for resources srun: job 34262460 has been allocated resources slurmstepd-sp-0016: error: pyxis: couldn't start container slurmstepd-sp-0016: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd-sp-0016: error: Failed to invoke spank plugin stack srun: error: sp-0016: task 0: Exited with exit code 1

would you please attend to this issue? Thank you.

flx42 commented 4 months ago

Could you look at the slurmd log to check if you have more details about the failure? Could you also try with a simpler command like srun --container-image ubuntu:22.04 hostname?

as7a5 commented 3 months ago

May I ask that command should run from login node or another compute node? from login node I get:

[siavoa01@bigpurple-ln2 ~]$ srun -p superpod -t 00:60:00 --mem=50G --container-image ubuntu:22.04 hostname pyxis: importing docker image: ubuntu:22.04 pyxis: imported docker image: ubuntu:22.04 slurmstepd-sp-0004: error: run_command: slurm task_prolog can not be executed (/cm/local/apps/cmd/scripts/taskprolog) No such file or directory slurmstepd-sp-0004: error: slurm task_prolog did not exit normally. reason: Run command failed - configuration error slurmstepd-sp-0004: error: TaskProlog failed status=1 srun: error: sp-0004: task 0: Exited with exit code 1

The /cm/local/apps/cmd/scripts/taskprolog exists and accessible

as7a5 commented 3 months ago

On the slurmd on the node not much extra info is available:

[2024-03-11T14:12:35.028] [34335866.extern] task/cgroup: _memcg_initialize: job: alloc=51200MB mem.limit=51200MB memsw.limit=51200MB job_swappiness=18446744073709551614 [2024-03-11T14:12:35.028] [34335866.extern] task/cgroup: _memcg_initialize: step: alloc=51200MB mem.limit=51200MB memsw.limit=51200MB job_swappiness=18446744073709551614 [2024-03-11T14:12:36.074] launch task StepId=34335866.0 request from UID:1023625167 GID:1023822516 HOST:172.16.0.102 PORT:55358 [2024-03-11T14:12:36.074] task/affinity: lllp_distribution: JobId=34335866 implicit auto binding: threads, dist 1 [2024-03-11T14:12:36.074] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2024-03-11T14:12:36.074] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [34335866]: mask_cpu, 0x0000000000000000000003FF00000000000000000000000003FF0000 [2024-03-11T14:12:36.620] [34335866.0] task/cgroup: _memcg_initialize: job: alloc=51200MB mem.limit=51200MB memsw.limit=51200MB job_swappiness=18446744073709551614 [2024-03-11T14:12:36.620] [34335866.0] task/cgroup: _memcg_initialize: step: alloc=51200MB mem.limit=51200MB memsw.limit=51200MB job_swappiness=18446744073709551614 [2024-03-11T14:12:36.732] [34335866.0] pyxis: importing docker image: ubuntu:22.04 [2024-03-11T14:12:37.756] [34335866.0] pyxis: imported docker image: ubuntu:22.04 [2024-03-11T14:12:37.756] [34335866.0] pyxis: creating container filesystem: pyxis_34335866_34335866.0 [2024-03-11T14:12:37.929] [34335866.0] pyxis: starting container: pyxis_34335866_34335866.0 [2024-03-11T14:12:38.117] [34335866.0] error: run_command: slurm task_prolog can not be executed (/cm/local/apps/cmd/scripts/taskprolog) No such file or directory [2024-03-11T14:12:38.117] [34335866.0] error: slurm task_prolog did not exit normally. reason: Run command failed - configuration error [2024-03-11T14:12:38.117] [34335866.0] error: TaskProlog failed status=1 [2024-03-11T14:12:38.164] [34335866.0] pyxis: removing container filesystem: pyxis_34335866_34335866.0 [2024-03-11T14:12:38.169] [34335866.extern] done with step [2024-03-11T14:12:38.245] [34335866.0] stepd_cleanup: done with step (rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error)

flx42 commented 3 months ago

Is this NVIDIA Base Command Manager? /cm/local/apps/cmd/scripts/taskprolog sounds like it might be. You should reach out to your support contact for this product to solve this error first, then if you still have pyxis issues afterwards I can take a look.

as7a5 commented 3 months ago

Thanks I will keep you updated.