Open lijunsong opened 1 year ago
@lijunsong could you update to the latest NVIDIA Container Toolkit (1.13.4
) and try again. I have just run the commands you supplied on the following:
$ nvidia-container-cli --version
cli-version: 1.13.4
lib-version: 1.13.4
build date: 2023-07-12T20:05+00:00
build revision: 31e068e7ab3e2294a379cbf11cc7a99281f41b66
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
and see read: Connection reset by peer
after choosing no
as the ssh option.
Running ls -al /dev/fuse crw-rw-rw- 1 root root 10, 229 May 23 19:04 /dev/fuse
on the host and in the container may give us some insights as to what the differences are in your case.
On host
$ stat /dev/fuse
File: /dev/fuse
Size: 0 Blocks: 0 IO Block: 4096 character special file
Device: 6h/6d Inode: 87 Links: 1 Device type: a,e5
Access: (0666/crw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2023-06-10 16:42:59.154345246 +0000
Modify: 2023-06-10 16:42:59.154345246 +0000
Change: 2023-06-10 16:42:59.154345246 +0000
Birth: -
in docker
$ sudo docker run -e NVIDIA_VISIBLE_DEVICES=0 -e NVIDIA_VISIBLE_DEVICES=all --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --runtime=nvidia --rm -it nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04 /bin/bash -c 'stat /dev/fuse'
==========
== CUDA ==
==========
CUDA Version 11.6.0
Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
File: /dev/fuse
Size: 0 Blocks: 0 IO Block: 4096 character special file
Device: 4000f4h/4194548d Inode: 571 Links: 1 Device type: a,e5
Access: (0666/crw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2023-07-14 17:38:04.441110907 +0000
Modify: 2023-07-14 17:38:04.441110907 +0000
Change: 2023-07-14 17:38:04.441110907 +0000
Birth: -
1. Issue or feature description
A privileged docker container can't do fuse mount when GPUs are enabled.
2. Steps to reproduce the issue
We can use a simple sshfs, which is known to work in docker, to test fuse. sshfs opens /dev/fuse before doing any network operations. So when we see
read: Connection reset by peer
, we know/dev/fuse
works in the docker.Case 1: nvidia runtime + ubuntu20.04 image without enabling any gpus.
/dev/fuse
OKCase 2: nvidia runtime + ubuntu20.04 image with gpus enabled.
/dev/fuse
failed.Case 3: nvidia runtime + official cuda image (regardless of gpu enabled).
/dev/fuse
failedCase 4: runc + official cuda image.
/dev/fuse
OK3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V