/run/nvidia-persistenced/socket: no such device or address

matyro commented 2 years ago

Hi, I was referenced here from https://github.com/NVIDIA/libnvidia-container/issues/187

We are currently setting up a new cluster deployment environment with slurm, pyxis and enroot. Our machines have DGX OS installed.

Container images like centos srun --container-image=centos grep PRETTY /etc/os-release

finish without a problem. GPU Based images like srun --gpu=1 --container-image=nvcr.io/nvidia/tensorflow:22.08-tf2-py3 /bin/bash

experiencing a problem during the startup:

slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     nvidia-container-cli: mount error: file creation failed: /raid/enroot-data/user-9011/pyxis_64.0/run/nvidia-persistenced/socket: no such device or address
slurmstepd: error: pyxis:     [ERROR] /raid/enroot//hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container

I think it is some misconfiguration, but at the moment I am not able to spot it.

enroot verify:

Linux version 5.4.0-117-generic (buildd@lcy02-amd64-006) (gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)) #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022

Kernel configuration:

CONFIG_NAMESPACES                 : OK
CONFIG_USER_NS                    : OK
CONFIG_SECCOMP_FILTER             : OK
CONFIG_OVERLAY_FS                 : OK (module)
CONFIG_X86_VSYSCALL_EMULATION     : OK
CONFIG_VSYSCALL_EMULATE           : KO (required if glibc <= 2.13)
CONFIG_VSYSCALL_NATIVE            : KO (required if glibc <= 2.13)

Kernel command line:

vsyscall=native                   : KO (required if glibc <= 2.13)
vsyscall=emulate                  : KO (required if glibc <= 2.13)

Kernel parameters:

kernel.unprivileged_userns_clone  : OK
user.max_user_namespaces          : OK
user.max_mnt_namespaces           : OK

Extra packages:

nvidia-container-cli              : OK

GLIBC Version:

ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31

/lib/x86_64-linux-gnu/libc.so.6 --version
GNU C Library (Ubuntu GLIBC 2.31-0ubuntu9.9) stable release version 2.31.
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 9.4.0.
libc ABIs: UNIQUE IFUNC ABSOLUTE
For bug reporting instructions, please see:
<https://bugs.launchpad.net/ubuntu/+source/glibc/+bugs>.

matyro commented 2 years ago

Manual start tries on the node giving the same results with all GPU based container:

enroot import 'docker://nvcr.io/nvidia/cuda:10.2-devel-ubuntu18.04'

[INFO] Querying registry for permission grant
[INFO] Authenticating with user: $oauthtoken
[INFO] Using credentials from file: /raid/enroot//.credentials
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[INFO] Downloading 10 missing layers...

100% 10:0=0s 1a5bbe4237bb13b91ce4674fd79dc748b428555f85c15232c1f6ca1493483418

[INFO] Extracting image layers...

100% 9:0=0s 726b8a513d66e3585eb57389171d97fcd348e4914a415891e1da135b85ffa6c3

[INFO] Converting whiteouts...

100% 9:0=0s 726b8a513d66e3585eb57389171d97fcd348e4914a415891e1da135b85ffa6c3

[INFO] Creating squashfs filesystem...

Parallel mksquashfs: Using 256 processors
Creating 4.0 filesystem on /tmp/nvidia+cuda+10.2-devel-ubuntu18.04.sqsh, block size 131072.
[=========================================================================================================================================================================================/] 31847/31847 100%

Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
        uncompressed data, uncompressed metadata, uncompressed fragments,
        uncompressed xattrs, uncompressed ids
        duplicates are not removed
Filesystem size 2981843.53 Kbytes (2911.96 Mbytes)
        99.99% of uncompressed filesystem size (2982265.77 Kbytes)
Inode table size 475654 bytes (464.51 Kbytes)
        100.00% of uncompressed inode table size (475654 bytes)
Directory table size 246175 bytes (240.41 Kbytes)
        100.00% of uncompressed directory table size (246175 bytes)
No duplicate files removed
Number of inodes 11509
Number of files 9393
Number of fragments 780
Number of symbolic links  762
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 1354
Number of ids (unique uids + gids) 1
Number of uids 1
        root (0)
Number of gids 1
        root (0)

enroot create --name cuda10.2-U18.04 nvidia+cuda+10.2-devel-ubuntu18.04.sqsh
[INFO] Extracting squashfs filesystem...

Parallel unsquashfs: Using 256 processors
10160 inodes (32609 blocks) to write

[=========================================================================================================================================================================================\] 32609/32609 100%

created 9393 files
created 1354 directories
created 762 symlinks
created 0 devices
created 0 fifos

enroot start --root --rw cuda10.2-U18.04 sh
nvidia-container-cli: mount error: file creation failed: /raid/enroot-data/user-9011/cuda10.2-U18.04/run/nvidia-persistenced/socket: no such device or address
[ERROR] /raid/enroot//hooks.d/98-nvidia.sh exited with return code 1

Edit 10/04:

After reinstalling it from package and using default config with the exception of
ENROOT_DATA_PATH=/raid/enroot-data/

and hooks:

ls /etc/enroot/hooks.d/
10-cgroups.sh  10-devices.sh  98-nvidia.sh  99-mellanox.sh

I am getting a new error:

nvidia-container-cli: container error: stat failed: /raid/enroot-data/cuda_test/proc/2203322: no such file or directory
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

3XX0 commented 2 years ago

Both of these issues are with libnvidia-container so not sure why they redirected you here.

/raid/enroot-data/user-9011/cuda10.2-U18.04/run/nvidia-persistenced/socket: no such device or address looks like /run has already been mounted and libnvidia-container tries to create it.

/raid/enroot-data/cuda_test/proc/2203322: no such file or directory not sure about this one, make sure /proc is correctly mounted in the container (use NVIDIA_VISIBLE_DEVICES=void to start the container without GPU support and check). Also check that your host procfs doesn't have hidepid or similar option that could confuse libnvidia-container.

matyro commented 2 years ago

Hi, enroot start -e NVIDIA_VISIBLE_DEVICES=void --root --rw cuda_root does work.

We are using a DGX A100 Node with nearly stock DGX OS. So as far as I know and as I found hidepid was not defined

NVIDIA / enroot

/run/nvidia-persistenced/socket: no such device or address #138