NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
282 stars 31 forks source link

slurmstepd: error: pyxis: [ERROR] URL https://registry-1.docker.io/v2/library/??? #78

Closed jieguolove closed 2 years ago

jieguolove commented 2 years ago

according to https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks image i have a local docker image,look like image

i try to run:

CONT='hpl-21.4-ok' srun -N 1 --ntasks-per-node=1 --cpu-bind=none \ --container-image="${CONT}" \ hpl.sh --config dgx-a100 --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat

but report error,why it request the url https://registry-1.docker.io/v2/library??? i need local image,how to get it ? thanks!

[root@hpl enroot]# ll total 0 drwxr-xr-x 23 root root 274 Jun 24 21:30 hpl-21.4-ok [root@hpl enroot]# pwd /root/.local/share/enroot [root@hpl enroot]# CONT='hpl-21.4-ok' [root@hpl enroot]# srun -N 1 --ntasks-per-node=1 --cpu-bind=none \

 --container-image="${CONT}" \
 hpl.sh --config dgx-a100 --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat

pyxis: importing docker image: hpl-21.4-ok slurmstepd: error: pyxis: child 30690 failed with error code: 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [INFO] Querying registry for permission grant slurmstepd: error: pyxis: [INFO] Authenticating with user: slurmstepd: error: pyxis: [INFO] Authentication succeeded slurmstepd: error: pyxis: [INFO] Fetching image manifest list slurmstepd: error: pyxis: [INFO] Fetching image manifest slurmstepd: error: pyxis: [ERROR] URL https://registry-1.docker.io/v2/library/hpl-21.4-ok/manifests/latest returned error code: 401 Unauthorized slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: hpl: task 0: Exited with exit code 1 [root@hpl enroot]# image

flx42 commented 2 years ago

You should do srun --container-image nvcr.io/nvidia/hpc-benchmarks:21.4-hpl. Pyxis cannot directly reference a local image from your docker daemon.

You could also use enroot import dockerd://hpl-21.4-ok and then pass the .sqsh image to pyxis with --container-image

jieguolove commented 2 years ago

thanks,I have installed slurm、pyxis and enroot,but not work?

[root@hpl dat-files]# cd /etc/slurm [root@hpl slurm]# ll total 1365292 -rw-r--r--. 1 root root 114 Jun 22 15:20 cgroup.conf -rw-r--r-- 1 root root 216 Jun 22 15:16 cgroup.conf.example -rw-r--r-- 1 root root 2825 Jun 22 15:16 cli_filter.lua.example -rw-r--r-- 1 root root 1398013952 Jun 24 21:25 hpl-21.4-ok.sqsh -rw-r--r-- 1 root root 4617 Jun 22 15:16 job_submit.lua.example -rw-r--r-- 1 root root 38 Jun 24 15:40 plugstack.conf drwxr-xr-x 2 root root 24 Jun 24 15:31 plugstack.conf.d -rw-r--r-- 1 root root 2879 Jun 22 15:16 prolog.example -rw-r--r--. 1 root root 1133 Jun 22 15:35 slurm.conf -rw-r--r-- 1 root root 3062 Jun 22 15:16 slurm.conf.example -rw-------. 1 slurm slurm 378 Jun 22 15:19 slurmdbd.conf -rw------- 1 root root 745 Jun 22 15:16 slurmdbd.conf.example [root@hpl slurm]# more plugstack.conf include /etc/slurm/plugstack.conf.d/* [root@hpl slurm]# ll plugstack.conf.d total 0 lrwxrwxrwx 1 root root 33 Jun 24 15:31 pyxis.conf -> /usr/local/share/pyxis/pyxis.conf [root@hpl dat-files]# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2022-06-24 15:40:43 CST; 17h ago Main PID: 4963 (slurmd) Tasks: 1 Memory: 1.4M CGroup: /system.slice/slurmd.service └─4963 /usr/sbin/slurmd -D -s

Jun 24 21:14:13 hpl slurmd[4963]: slurmd: launch task StepId=67.0 request from UID:0 GID:0 HOST:192.168.207.50 PORT:42482 Jun 24 21:14:41 hpl slurmd[4963]: slurmd: launch task StepId=68.0 request from UID:0 GID:0 HOST:192.168.207.50 PORT:42516 Jun 24 21:23:06 hpl slurmd[4963]: slurmd: launch task StepId=69.0 request from UID:0 GID:0 HOST:192.168.207.50 PORT:42554 Jun 24 21:42:48 hpl slurmd[4963]: slurmd: launch task StepId=70.0 request from UID:0 GID:0 HOST:192.168.207.50 PORT:42626 Jun 24 21:46:47 hpl slurmd[4963]: slurmd: launch task StepId=71.0 request from UID:0 GID:0 HOST:192.168.207.50 PORT:42664 Jun 24 21:48:12 hpl slurmd[4963]: slurmd: launch task StepId=72.0 request from UID:0 GID:0 HOST:192.168.207.50 PORT:42698 Jun 24 21:58:25 hpl slurmd[4963]: slurmd: launch task StepId=73.0 request from UID:0 GID:0 HOST:192.168.207.50 PORT:42738 Jun 25 08:55:41 hpl slurmd[4963]: slurmd: launch task StepId=74.0 request from UID:0 GID:0 HOST:192.168.207.50 PORT:43326 Jun 25 08:57:22 hpl slurmd[4963]: slurmd: launch task StepId=75.0 request from UID:0 GID:0 HOST:192.168.207.50 PORT:43358 Jun 25 08:58:43 hpl slurmd[4963]: slurmd: launch task StepId=76.0 request from UID:0 GID:0 HOST:192.168.207.50 PORT:43392

[root@hpl dat-files]# srun --help|grep container-image --container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH

[root@hpl dat-files]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE hpl-21.4-ok latest 79af28d3f237 20 hours ago 1.5GB hpl-20220624 latest fa692f2fab01 20 hours ago 1.5GB hpl-21.4-03 latest 60ee2b75e684 36 hours ago 1.5GB hpl-21.4-02 latest 16329db456ca 41 hours ago 1.5GB hpl-20.10-01 latest 6332e5b378a9 44 hours ago 1.41GB hpl-01 latest 7ec871dd1d2a 47 hours ago 1.5GB nvidia/cuda 11.0.3-base-ubuntu20.04 d134f267bb7a 5 weeks ago 122MB hello-world latest feb5d9fea6a5 9 months ago 13.3kB nvcr.io/nvidia/hpc-benchmarks 21.4-hpl a40277aefbad 14 months ago 1.47GB nvcr.io/nvidia/hpc-benchmarks 20.10-hpl 97462d23c5ca 20 months ago 1.41GB

[root@hpl slurm]# cd /root/hpl/dat-files/ [root@hpl dat-files]# ll total 1367600 -rw-r--r-- 1 root root 1398013952 Jun 25 08:58 hpl-21.4-ok.sqsh -rw-r--r-- 1 root root 1132 Jun 24 13:18 HPL-dgx-a100-1N.dat -rwxr-xr-x 1 root root 8555 Jun 24 13:18 hpl.sh drwxr-xr-x 3 root root 48 Apr 15 2021 intel drwxr-xr-x 7 root root 67 Apr 14 2021 openmpi drwxr-xr-x 4 root root 32 Apr 8 2021 pmi -rwxr-xr-x 1 root root 2391600 Jun 24 13:18 xhpl [root@hpl dat-files]# CONT='hpl-21.4-ok' [root@hpl dat-files]# MOUNT="/root/hpl/dat-files:/my-dat-files" [root@hpl dat-files]# [root@hpl dat-files]# srun -N 1 --ntasks-per-node=1 --cpu-bind=none \

 --container-image="${CONT}" \
 --container-mounts="${MOUNT}" \
 hpl.sh --config dgx-a100 --cpu-affinity 0 --cpu-cores-per-rank 32 --gpu-affinity 0 --mem-affinity 0 --dat /my-dat-files/HPL-dgx-a100-1N.dat

pyxis: importing docker image: hpl-21.4-ok slurmstepd: error: pyxis: child 11558 failed with error code: 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [INFO] Querying registry for permission grant slurmstepd: error: pyxis: [INFO] Authenticating with user: slurmstepd: error: pyxis: [INFO] Authentication succeeded slurmstepd: error: pyxis: [INFO] Fetching image manifest list slurmstepd: error: pyxis: [INFO] Fetching image manifest slurmstepd: error: pyxis: [ERROR] URL https://registry-1.docker.io/v2/library/hpl-21.4-ok/manifests/latest returned error code: 401 Unauthorized slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: hpl: task 0: Exited with exit code 1

[root@hpl dat-files]# enroot import dockerd://hpl-21.4-ok [INFO] Fetching image

74dbf6c8579c24f51dd7570ffc000e52a732e858080d273e3408666f15fdedf0

[INFO] Extracting image content... [INFO] Creating squashfs filesystem...

Parallel mksquashfs: Using 32 processors Creating 4.0 filesystem on /root/hpl/dat-files/hpl-21.4-ok.sqsh, block size 131072. [==============================================================================================================================================================================-] 19428/19428 100%

Exportable Squashfs 4.0 filesystem, lzo compressed, data block size 131072 uncompressed data, compressed metadata, compressed fragments, compressed xattrs duplicates are removed Filesystem size 1365247.62 Kbytes (1333.25 Mbytes) 96.15% of uncompressed filesystem size (1419912.37 Kbytes) Inode table size 148712 bytes (145.23 Kbytes) 33.42% of uncompressed inode table size (445038 bytes) Directory table size 130324 bytes (127.27 Kbytes) 48.12% of uncompressed directory table size (270829 bytes) Number of duplicate files found 1842 Number of inodes 11742 Number of files 9147 Number of fragments 690 Number of symbolic links 1264 Number of device nodes 0 Number of fifo nodes 0 Number of socket nodes 0 Number of directories 1331 Number of ids (unique uids + gids) 1 Number of uids 1 root (0) Number of gids 1 root (0) [root@hpl dat-files]# ll total 1367600 -rw-r--r-- 1 root root 1398013952 Jun 25 08:58 hpl-21.4-ok.sqsh -rw-r--r-- 1 root root 1132 Jun 24 13:18 HPL-dgx-a100-1N.dat -rwxr-xr-x 1 root root 8555 Jun 24 13:18 hpl.sh drwxr-xr-x 3 root root 48 Apr 15 2021 intel drwxr-xr-x 7 root root 67 Apr 14 2021 openmpi drwxr-xr-x 4 root root 32 Apr 8 2021 pmi -rwxr-xr-x 1 root root 2391600 Jun 24 13:18 xhpl

[root@hpl dat-files]# CONT='hpl-21.4-ok.sqsh' [root@hpl dat-files]# MOUNT="/root/hpl/dat-files:/my-dat-files" [root@hpl dat-files]# [root@hpl dat-files]# srun -N 1 --ntasks-per-node=1 --cpu-bind=none \

 --container-image="${CONT}" \
 --container-mounts="${MOUNT}" \
 hpl.sh --config dgx-a100 --cpu-affinity 0 --cpu-cores-per-rank 32 --gpu-affinity 0 --mem-affinity 0 --dat /my-dat-files/HPL-dgx-a100-1N.dat

pyxis: importing docker image: hpl-21.4-ok.sqsh slurmstepd: error: pyxis: child 10805 failed with error code: 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [INFO] Querying registry for permission grant slurmstepd: error: pyxis: [INFO] Authenticating with user: slurmstepd: error: pyxis: [INFO] Authentication succeeded slurmstepd: error: pyxis: [INFO] Fetching image manifest list slurmstepd: error: pyxis: [INFO] Fetching image manifest slurmstepd: error: pyxis: [ERROR] URL https://registry-1.docker.io/v2/library/hpl-21.4-ok.sqsh/manifests/latest returned error code: 401 Unauthorized slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: hpl: task 0: Exited with exit code 1

flx42 commented 2 years ago

You should use either CONT='nvcr.io/nvidia/hpc-benchmarks:21.4-hpl' or enroot import dockerd://hpl-21.4-ok + CONT='./hpl-21.4-ok.sqsh' (note the addition of./`)

jieguolove commented 2 years ago

thanks,but why still report errors? [root@hpl enroot]# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_18:49:52_PDT_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0 [root@hpl dat-files]# nvidia-smi Sun Jun 26 15:52:09 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:82:00.0 Off | 0 | | N/A 79C P0 59W / 250W | 0MiB / 40960MiB | 3% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ [root@hpl hpl]# cd dat-files/ [root@hpl dat-files]# ll total 1367600 -rw-r--r-- 1 root root 1398013952 Jun 25 08:58 hpl-21.4-ok.sqsh -rw-r--r-- 1 root root 1132 Jun 24 13:18 HPL-dgx-a100-1N.dat -rwxr-xr-x 1 root root 8555 Jun 24 13:18 hpl.sh drwxr-xr-x 3 root root 48 Apr 15 2021 intel drwxr-xr-x 7 root root 67 Apr 14 2021 openmpi drwxr-xr-x 4 root root 32 Apr 8 2021 pmi -rwxr-xr-x 1 root root 2391600 Jun 24 13:18 xhpl [root@hpl dat-files]# CONT='./hpl-21.4-ok.sqsh' [root@hpl dat-files]# MOUNT="/root/hpl/dat-files:/my-dat-files" [root@hpl dat-files]# srun -N 1 --ntasks-per-node=1 --cpu-bind=none \

--container-image="${CONT}" \ --container-mounts="${MOUNT}" \ hpl.sh --config dgx-a100 --cpu-affinity 0 --cpu-cores-per-rank 32 --gpu-affinity 0 --mem-affinity 0 --dat /my-dat-files/HPL-dgx-a100-1N.dat slurmstepd: error: pyxis: container start failed with error code: 1 slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [WARN] Kernel module nvidia_uvm is not loaded. Make sure the NVIDIA device driver is installed and loaded. slurmstepd: error: pyxis: enroot-mount: failed to mount: tmpfs at /root/.local/share/enroot/pyxis_78.0/sys/class/infiniband: No such file or directory slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1 slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: hpl: task 0: Exited with exit code 1

[root@hpl dat-files]# [root@hpl dat-files]# [root@hpl dat-files]# CONT='nvcr.io/nvidia/hpc-benchmarks:21.4-hpl' [root@hpl dat-files]# MOUNT="/root/hpl/dat-files:/my-dat-files" [root@hpl dat-files]# srun -N 1 --ntasks-per-node=1 --cpu-bind=none \

--container-image="${CONT}" \ --container-mounts="${MOUNT}" \ hpl.sh --config dgx-a100 --cpu-affinity 0 --cpu-cores-per-rank 32 --gpu-affinity 0 --mem-affinity 0 --dat /my-dat-files/HPL-dgx-a100-1N.dat pyxis: importing docker image: nvcr.io/nvidia/hpc-benchmarks:21.4-hpl slurmstepd: error: pyxis: child 27478 failed with error code: 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [INFO] Querying registry for permission grant slurmstepd: error: pyxis: [INFO] Authenticating with user: slurmstepd: error: pyxis: [INFO] Authentication succeeded slurmstepd: error: pyxis: [INFO] Fetching image manifest list slurmstepd: error: pyxis: [INFO] Fetching image manifest slurmstepd: error: pyxis: [ERROR] URL https://registry-1.docker.io/v2/nvcr.io/nvidia/hpc-benchmarks/manifests/21.4-hpl returned error code: 401 Unauthorized slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: hpl: task 0: Exited with exit code 1 [root@hpl dat-files]# [root@hpl dat-files]# [root@hpl dat-files]# cd /root/.local/share/enroot/ [root@hpl enroot]# ll total 0 drwxr-xr-x 23 root root 274 Jun 24 21:30 hpl-21.4-ok [root@hpl enroot]# ll /etc/enroot/hooks.d/99-mellanox.sh -rwxr-xr-x 1 root root 7998 Nov 13 2021 /etc/enroot/hooks.d/99-mellanox.sh [root@hpl enroot]#
[root@hpl enroot]# [root@hpl enroot]# [root@hpl enroot]# [root@hpl enroot]# CONT='./hpl-21.4-ok' [root@hpl enroot]# MOUNT="/root/hpl/dat-files:/my-dat-files" [root@hpl enroot]# srun -N 1 --ntasks-per-node=1 --cpu-bind=none \ --container-image="${CONT}" \ --container-mounts="${MOUNT}" \ hpl.sh --config dgx-a100 --cpu-affinity 0 --cpu-cores-per-rank 32 --gpu-affinity 0 --mem-affinity 0 --dat /my-dat-files/HPL-dgx-a100-1N.dat slurmstepd: error: pyxis: child 27748 failed with error code: 1 slurmstepd: error: pyxis: failed to create container filesystem slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [ERROR] No such file or directory: /root/.local/share/enroot/hpl-21.4-ok slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: hpl: task 0: Exited with exit code 1 [root@hpl enroot]# ll /root/.local/share/enroot/hpl-21.4-ok total 12 drwxr-xr-x 2 root root 4096 Apr 15 2021 bin drwxr-xr-x 2 root root 6 Apr 24 2018 boot drwxr-xr-x 2 root root 6 Jun 24 21:25 dev drwxr-xr-x 41 root root 4096 Jun 24 21:30 etc drwxr-xr-x 2 root root 6 Apr 24 2018 home drwxr-xr-x 9 root root 112 Jun 23 09:11 lib drwxr-xr-x 2 root root 34 Feb 23 2021 lib64 drwxr-xr-x 2 root root 6 Feb 23 2021 media drwxr-xr-x 2 root root 6 Feb 23 2021 mnt drwxr-xr-x 2 root root 6 Jun 23 09:11 my-dat-files drwxr-xr-x 3 root root 19 Apr 15 2021 opt drwxr-xr-x 2 root root 6 Apr 24 2018 proc drwx------ 2 root root 71 Jun 23 20:48 root drwxr-xr-x 5 root root 58 Mar 4 2021 run drwxr-xr-x 2 root root 4096 Mar 4 2021 sbin drwxr-xr-x 2 root root 6 Feb 23 2021 srv drwxr-xr-x 2 root root 6 Apr 24 2018 sys drwxrwxrwt 2 root root 6 Jun 23 20:48 tmp drwxr-xr-x 10 root root 105 Feb 23 2021 usr drwxr-xr-x 11 root root 139 Feb 23 2021 var drwxr-xr-x 4 root root 111 Jun 24 13:16 workspace

jieguolove commented 2 years ago

the last errors:

[root@hpl dat-files]# CONT='./hpl-21.4-ok.sqsh' [root@hpl dat-files]# MOUNT="/root/hpl/dat-files:/my-dat-files" [root@hpl dat-files]# srun -N 1 --ntasks-per-node=1 --cpu-bind=none \

--container-image="${CONT}" \ --container-mounts="${MOUNT}" \ hpl.sh --config dgx-a100 --cpu-affinity 0 --cpu-cores-per-rank 32 --gpu-affinity 0 --mem-affinity 0 --dat /my-dat-files/HPL-dgx-a100-1N.dat INFO: host=hpl rank=0 lrank=0 cores=32 gpu=0 cpu=0 mem=0 net=lo bin=/workspace/hpl-linux-x86_64/xhpl An error occurred in MPI_Init on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) [hpl:21881] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: hpl: task 0: Exited with exit code 1 [root@hpl dat-files]#

jieguolove commented 2 years ago

the last errors:

[root@hpl dat-files]# CONT='./hpl-21.4-ok.sqsh' [root@hpl dat-files]# MOUNT="/root/hpl/dat-files:/my-dat-files" [root@hpl dat-files]# srun -N 1 --ntasks-per-node=1 --cpu-bind=none \

--container-image="${CONT}" --container-mounts="${MOUNT}" hpl.sh --config dgx-a100 --cpu-affinity 0 --cpu-cores-per-rank 32 --gpu-affinity 0 --mem-affinity 0 --dat /my-dat-files/HPL-dgx-a100-1N.dat INFO: host=hpl rank=0 lrank=0 cores=32 gpu=0 cpu=0 mem=0 net=lo bin=/workspace/hpl-linux-x86_64/xhpl An error occurred in MPI_Init on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) [hpl:21881] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: hpl: task 0: Exited with exit code 1 [root@hpl dat-files]#

thanks flx42! thanks flx42! thanks flx42!

i have sloved: Add parameters --mpi=pmi2

srun -N 1 --mpi=pmi2 --ntasks-per-node=1 --cpu-bind=none \ --container-image="${CONT}" --container-mounts="${MOUNT}" hpl.sh --config dgx-a100 --cpu-affinity 0 --cpu-cores-per-rank 32 --gpu-affinity 0 --mem-affinity 0 --dat /my-dat-files/HPL-dgx-a100-1N.dat