NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.24k stars 244 forks source link

/dev/fuse becomes unavailable when GPUs are enabled #80

Open lijunsong opened 1 year ago

lijunsong commented 1 year ago

1. Issue or feature description

A privileged docker container can't do fuse mount when GPUs are enabled.

2. Steps to reproduce the issue

We can use a simple sshfs, which is known to work in docker, to test fuse. sshfs opens /dev/fuse before doing any network operations. So when we see read: Connection reset by peer, we know /dev/fuse works in the docker.

Case 1: nvidia runtime + ubuntu20.04 image without enabling any gpus. /dev/fuse OK

$ docker run --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --rm -it --runtime=nvidia ubuntu:20.04 /bin/bash -c 'apt-get update && apt-get install --no-install-recommends -y sshfs; mkdir /tmp/x; sshfs yo@127.0.0.1:~/ /tmp/x'
...
read: Connection reset by peer

Case 2: nvidia runtime + ubuntu20.04 image with gpus enabled. /dev/fuse failed.

$ docker run -e NVIDIA_VISIBLE_DEVICES=0 -e NVIDIA_VISIBLE_DEVICES=all --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --rm -it --runtime=nvidia ubuntu:20.04 /bin/bash -c 'apt-get update && apt-get install --no-install-recommends -y sshfs; mkdir /tmp/x; sshfs yo@127.0.0.1:~/ /tmp/x'
...
fuse: failed to open /dev/fuse: Operation not permitted

Case 3: nvidia runtime + official cuda image (regardless of gpu enabled). /dev/fuse failed

$ docker run -e NVIDIA_VISIBLE_DEVICES=0 -e NVIDIA_VISIBLE_DEVICES=all --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --rm -it --runtime=nvidia nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04 /bin/bash -c 'apt-get update && apt-get install --no-install-recommends -y sshfs; mkdir /tmp/x; sshfs yo@127.0.0.1:~/ /tmp/x'
...
fuse: failed to open /dev/fuse: Operation not permitted

Case 4: runc + official cuda image. /dev/fuse OK

docker run -e NVIDIA_VISIBLE_DEVICES=0 -e NVIDIA_VISIBLE_DEVICES=all --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --runtime=runc --rm -it nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04 /bin/bash -c 'apt-get update && apt-get install --no-install-recommends -y sshfs; mkdir /tmp/x; sshfs yo@127.0.0.1:~/ /tmp/x'
...
read: Connection reset by peer

3. Information to attach (optional if deemed irrelevant)

elezar commented 1 year ago

@lijunsong could you update to the latest NVIDIA Container Toolkit (1.13.4) and try again. I have just run the commands you supplied on the following:

$ nvidia-container-cli --version
cli-version: 1.13.4
lib-version: 1.13.4
build date: 2023-07-12T20:05+00:00
build revision: 31e068e7ab3e2294a379cbf11cc7a99281f41b66
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

and see read: Connection reset by peer after choosing no as the ssh option.

elezar commented 1 year ago

Running ls -al /dev/fuse crw-rw-rw- 1 root root 10, 229 May 23 19:04 /dev/fuse on the host and in the container may give us some insights as to what the differences are in your case.

lijunsong commented 1 year ago

On host

$ stat /dev/fuse
  File: /dev/fuse
  Size: 0           Blocks: 0          IO Block: 4096   character special file
Device: 6h/6d   Inode: 87          Links: 1     Device type: a,e5
Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-06-10 16:42:59.154345246 +0000
Modify: 2023-06-10 16:42:59.154345246 +0000
Change: 2023-06-10 16:42:59.154345246 +0000
 Birth: -

in docker

$ sudo docker run -e NVIDIA_VISIBLE_DEVICES=0 -e NVIDIA_VISIBLE_DEVICES=all --network host --privileged --cap-add SYS_ADMIN --security-opt apparmor:unconfined --runtime=nvidia --rm -it nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04 /bin/bash -c 'stat /dev/fuse'

==========
== CUDA ==
==========

CUDA Version 11.6.0

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

  File: /dev/fuse
  Size: 0           Blocks: 0          IO Block: 4096   character special file
Device: 4000f4h/4194548d    Inode: 571         Links: 1     Device type: a,e5
Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-07-14 17:38:04.441110907 +0000
Modify: 2023-07-14 17:38:04.441110907 +0000
Change: 2023-07-14 17:38:04.441110907 +0000
 Birth: -