/dev/fuse becomes unavailable when GPUs are enabled

lijunsong commented 1 year ago

1. Issue or feature description

A privileged docker container can't do fuse mount when GPUs are enabled.

2. Steps to reproduce the issue

We can use a simple sshfs, which is known to work in docker, to test fuse. sshfs opens /dev/fuse before doing any network operations. So when we see read: Connection reset by peer, we know /dev/fuse works in the docker.

Case 1: nvidia runtime + ubuntu20.04 image without enabling any gpus. /dev/fuse OK

$ docker run --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --rm -it --runtime=nvidia ubuntu:20.04 /bin/bash -c 'apt-get update && apt-get install --no-install-recommends -y sshfs; mkdir /tmp/x; sshfs yo@127.0.0.1:~/ /tmp/x'
...
read: Connection reset by peer

Case 2: nvidia runtime + ubuntu20.04 image with gpus enabled. /dev/fuse failed.

$ docker run -e NVIDIA_VISIBLE_DEVICES=0 -e NVIDIA_VISIBLE_DEVICES=all --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --rm -it --runtime=nvidia ubuntu:20.04 /bin/bash -c 'apt-get update && apt-get install --no-install-recommends -y sshfs; mkdir /tmp/x; sshfs yo@127.0.0.1:~/ /tmp/x'
...
fuse: failed to open /dev/fuse: Operation not permitted

Case 3: nvidia runtime + official cuda image (regardless of gpu enabled). /dev/fuse failed

$ docker run -e NVIDIA_VISIBLE_DEVICES=0 -e NVIDIA_VISIBLE_DEVICES=all --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --rm -it --runtime=nvidia nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04 /bin/bash -c 'apt-get update && apt-get install --no-install-recommends -y sshfs; mkdir /tmp/x; sshfs yo@127.0.0.1:~/ /tmp/x'
...
fuse: failed to open /dev/fuse: Operation not permitted

Case 4: runc + official cuda image. /dev/fuse OK

docker run -e NVIDIA_VISIBLE_DEVICES=0 -e NVIDIA_VISIBLE_DEVICES=all --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --runtime=runc --rm -it nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04 /bin/bash -c 'apt-get update && apt-get install --no-install-recommends -y sshfs; mkdir /tmp/x; sshfs yo@127.0.0.1:~/ /tmp/x'
...
read: Connection reset by peer

3. Information to attach (optional if deemed irrelevant)

[ ] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info

[x] Kernel version from uname -a

Linux xxxxxxxxx 5.4.0-148-generic NVIDIA/nvidia-docker#165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

[ ] Any relevant kernel output lines from dmesg
[ ] Driver information from nvidia-smi -a
[ ] Docker version from docker version

[x] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'

dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version       Architecture Description
+++-=============================-=============-============-=====================================================
ii  libnvidia-container-tools     1.12.0~rc.1-1 amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.12.0~rc.1-1 amd64        NVIDIA container runtime library
un  nvidia-container-runtime      <none>        <none>       (no description available)
un  nvidia-container-runtime-hook <none>        <none>       (no description available)
ii  nvidia-container-toolkit      1.12.0~rc.2-1 amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base 1.12.0~rc.2-1 amd64        NVIDIA Container Toolkit Base
un  nvidia-docker                 <none>        <none>       (no description available)
rc  nvidia-docker2                2.13.0-1      all          nvidia-docker CLI wrapper

[x] NVIDIA container library version from nvidia-container-cli -V

nvidia-container-cli -V
cli-version: 1.12.0~rc.1
lib-version: 1.12.0~rc.1
build date: 2023-04-12T09:34+00:00
build revision: 53cd6cd7e16bcb875956ec193cae8c53f01afe6e
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

[ ] NVIDIA container library logs (see troubleshooting)
[ ] Docker command, image and tag used

elezar commented 1 year ago

@lijunsong could you update to the latest NVIDIA Container Toolkit (1.13.4) and try again. I have just run the commands you supplied on the following:

$ nvidia-container-cli --version
cli-version: 1.13.4
lib-version: 1.13.4
build date: 2023-07-12T20:05+00:00
build revision: 31e068e7ab3e2294a379cbf11cc7a99281f41b66
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

and see read: Connection reset by peer after choosing no as the ssh option.

elezar commented 1 year ago

Running ls -al /dev/fuse crw-rw-rw- 1 root root 10, 229 May 23 19:04 /dev/fuse on the host and in the container may give us some insights as to what the differences are in your case.

lijunsong commented 1 year ago

On host

$ stat /dev/fuse
  File: /dev/fuse
  Size: 0           Blocks: 0          IO Block: 4096   character special file
Device: 6h/6d   Inode: 87          Links: 1     Device type: a,e5
Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-06-10 16:42:59.154345246 +0000
Modify: 2023-06-10 16:42:59.154345246 +0000
Change: 2023-06-10 16:42:59.154345246 +0000
 Birth: -

in docker

$ sudo docker run -e NVIDIA_VISIBLE_DEVICES=0 -e NVIDIA_VISIBLE_DEVICES=all --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --runtime=nvidia --rm -it nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04 /bin/bash -c 'stat /dev/fuse'

==========
== CUDA ==
==========

CUDA Version 11.6.0

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

  File: /dev/fuse
  Size: 0           Blocks: 0          IO Block: 4096   character special file
Device: 4000f4h/4194548d    Inode: 571         Links: 1     Device type: a,e5
Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-07-14 17:38:04.441110907 +0000
Modify: 2023-07-14 17:38:04.441110907 +0000
Change: 2023-07-14 17:38:04.441110907 +0000
 Birth: -

NVIDIA / nvidia-container-toolkit