Open matyro opened 2 years ago
Manual start tries on the node giving the same results with all GPU based container:
enroot import 'docker://nvcr.io/nvidia/cuda:10.2-devel-ubuntu18.04'
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: $oauthtoken
[INFO] Using credentials from file: /raid/enroot//.credentials
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[INFO] Downloading 10 missing layers...
100% 10:0=0s 1a5bbe4237bb13b91ce4674fd79dc748b428555f85c15232c1f6ca1493483418
[INFO] Extracting image layers...
100% 9:0=0s 726b8a513d66e3585eb57389171d97fcd348e4914a415891e1da135b85ffa6c3
[INFO] Converting whiteouts...
100% 9:0=0s 726b8a513d66e3585eb57389171d97fcd348e4914a415891e1da135b85ffa6c3
[INFO] Creating squashfs filesystem...
Parallel mksquashfs: Using 256 processors
Creating 4.0 filesystem on /tmp/nvidia+cuda+10.2-devel-ubuntu18.04.sqsh, block size 131072.
[=========================================================================================================================================================================================/] 31847/31847 100%
Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
uncompressed data, uncompressed metadata, uncompressed fragments,
uncompressed xattrs, uncompressed ids
duplicates are not removed
Filesystem size 2981843.53 Kbytes (2911.96 Mbytes)
99.99% of uncompressed filesystem size (2982265.77 Kbytes)
Inode table size 475654 bytes (464.51 Kbytes)
100.00% of uncompressed inode table size (475654 bytes)
Directory table size 246175 bytes (240.41 Kbytes)
100.00% of uncompressed directory table size (246175 bytes)
No duplicate files removed
Number of inodes 11509
Number of files 9393
Number of fragments 780
Number of symbolic links 762
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 1354
Number of ids (unique uids + gids) 1
Number of uids 1
root (0)
Number of gids 1
root (0)
enroot create --name cuda10.2-U18.04 nvidia+cuda+10.2-devel-ubuntu18.04.sqsh
[INFO] Extracting squashfs filesystem...
Parallel unsquashfs: Using 256 processors
10160 inodes (32609 blocks) to write
[=========================================================================================================================================================================================\] 32609/32609 100%
created 9393 files
created 1354 directories
created 762 symlinks
created 0 devices
created 0 fifos
enroot start --root --rw cuda10.2-U18.04 sh
nvidia-container-cli: mount error: file creation failed: /raid/enroot-data/user-9011/cuda10.2-U18.04/run/nvidia-persistenced/socket: no such device or address
[ERROR] /raid/enroot//hooks.d/98-nvidia.sh exited with return code 1
Edit 10/04:
After reinstalling it from package and using default config with the exception of
ENROOT_DATA_PATH=/raid/enroot-data/
and hooks:
ls /etc/enroot/hooks.d/
10-cgroups.sh 10-devices.sh 98-nvidia.sh 99-mellanox.sh
I am getting a new error:
nvidia-container-cli: container error: stat failed: /raid/enroot-data/cuda_test/proc/2203322: no such file or directory
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
Both of these issues are with libnvidia-container so not sure why they redirected you here.
/raid/enroot-data/user-9011/cuda10.2-U18.04/run/nvidia-persistenced/socket: no such device or address
looks like /run
has already been mounted and libnvidia-container tries to create it.
/raid/enroot-data/cuda_test/proc/2203322: no such file or directory
not sure about this one, make sure /proc
is correctly mounted in the container (use NVIDIA_VISIBLE_DEVICES=void
to start the container without GPU support and check).
Also check that your host procfs doesn't have hidepid
or similar option that could confuse libnvidia-container.
Hi,
enroot start -e NVIDIA_VISIBLE_DEVICES=void --root --rw cuda_root
does work.
We are using a DGX A100 Node with nearly stock DGX OS. So as far as I know and as I found hidepid was not defined
Hi, I was referenced here from https://github.com/NVIDIA/libnvidia-container/issues/187
We are currently setting up a new cluster deployment environment with slurm, pyxis and enroot. Our machines have DGX OS installed.
Container images like centos
srun --container-image=centos grep PRETTY /etc/os-release
finish without a problem. GPU Based images like
srun --gpu=1 --container-image=nvcr.io/nvidia/tensorflow:22.08-tf2-py3 /bin/bash
experiencing a problem during the startup:
I think it is some misconfiguration, but at the moment I am not able to spot it.
enroot verify:
GLIBC Version: