NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
1.9k stars 216 forks source link

Docker Unable to access gpu with the --gpus flag without SUDO #175

Open szhang99-bu opened 7 months ago

szhang99-bu commented 7 months ago

1. Issue or feature description

I am currently trying to install a version of Alphafold 2 on a desktop with a 3090. By following the installation instruction, i run into the issue of unable to run docker with NVIDIA container toolkit without sudo I have correctly installed docker desktop & NVIDIA container toolkit, and followed the steps to add docker to user

$ sudo groupadd docker
$ sudo usermod -aG docker $USER
$ newgrp docker 

And currently, docker can run the verfication hello world step with no issue.

$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

However, when running docker with --gpu flag sudo is required

$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
$ sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Mon Dec  4 03:02:42 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:0A:00.0  On |                  N/A |
| 53%   28C    P8              52W / 390W |    580MiB / 24576MiB |      9%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

It seems other people have run into the same issue: https://github.com/google-deepmind/alphafold/issues/865#issue-2007089233

2. Steps to reproduce the issue

docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

3. Information to attach (optional if deemed irrelevant)

 ==============NVSMI LOG==============
Timestamp                                 : Sun Dec  3 22:12:31 2023
Driver Version                            : 535.129.03
CUDA Version                              : 12.2
Attached GPUs                             : 1
GPU 00000000:0A:00.0
    Product Name                          : NVIDIA GeForce RTX 3090
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-10a43dff-fecb-c8f6-905e-f0fbea173478
    Minor Number                          : 0
    VBIOS Version                         : 94.02.42.00.A9
    MultiGPU Board                        : No
    Board ID                              : 0xa00
    Board Part Number                     : N/A
    GPU Part Number                       : 2204-300-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x0A
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220410DE
        Bus Id                            : 00000000:0A:00.0
        Sub System Id                     : 0x87AF1043
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 2000 KB/s
        Rx Throughput                     : 1000 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 53 %
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24576 MiB
        Reserved                          : 319 MiB
        Used                              : 567 MiB
        Free                              : 23689 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 8 MiB
        Free                              : 248 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 23 %
        Memory                            : 14 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 28 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 52.85 W
        Current Power Limit               : 390.00 W
        Requested Power Limit             : 390.00 W
        Default Power Limit               : 390.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 480.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 420 MHz
        SM                                : 420 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2130 MHz
        SM                                : 2130 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 750.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2511
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 278 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2646
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 75 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 4019
            Type                          : G
            Name                          : /opt/docker-desktop/Docker Desktop --type=gpu-process --enable-crash-reporter=5be23c03-174b-4789-a2c7-a9c63e76421d,no_channel --user-data-dir=/home/shiyuzhang/.config/Docker Desktop --gpu-preferences=WAAAAAAAAAAgAAAIAAAAAAAAAAAAAAAAAABgAAAAAAA4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAIAAAAAAAAAABAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAIAAAAAAAAAA== --shared-files --field-trial-handle=0,i,12713112242335749660,15840127836375193179,131072 --disable-features=SpareRendererForSitePerProcess
            Used GPU Memory               : 11 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 4782
            Type                          : G
            Name                          : /snap/firefox/2987/usr/lib/firefox/firefox
            Used GPU Memory               : 161 MiB
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                       Version                     Architecture Description
+++-==========================================-===========================-============-===============================================>
un  libgldispatch0-nvidia                      <none>                      <none>       (no description available)
ii  libnvidia-cfg1-535:amd64                   535.129.03-0ubuntu0.22.04.1 amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                         <none>                      <none>       (no description available)
un  libnvidia-common                           <none>                      <none>       (no description available)
ii  libnvidia-common-535                       535.129.03-0ubuntu0.22.04.1 all          Shared files used by the NVIDIA libraries
un  libnvidia-compute                          <none>                      <none>       (no description available)
un  libnvidia-compute-495                      <none>                      <none>       (no description available)
un  libnvidia-compute-495-server               <none>                      <none>       (no description available)
ii  libnvidia-compute-535:amd64                535.129.03-0ubuntu0.22.04.1 amd64        NVIDIA libcompute package
ii  libnvidia-compute-535:i386                 535.129.03-0ubuntu0.22.04.1 i386         NVIDIA libcompute package
ii  libnvidia-container-tools                  1.14.3-1                    amd64        NVIDIA container runtime library (command-line >
ii  libnvidia-container1:amd64                 1.14.3-1                    amd64        NVIDIA container runtime library
un  libnvidia-decode                           <none>                      <none>       (no description available)
ii  libnvidia-decode-535:amd64                 535.129.03-0ubuntu0.22.04.1 amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-535:i386                  535.129.03-0ubuntu0.22.04.1 i386         NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                           <none>                      <none>       (no description available)
ii  libnvidia-encode-535:amd64                 535.129.03-0ubuntu0.22.04.1 amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-535:i386                  535.129.03-0ubuntu0.22.04.1 i386         NVENC Video Encoding runtime library
un  libnvidia-extra                            <none>                      <none>       (no description available)
ii  libnvidia-extra-535:amd64                  535.129.03-0ubuntu0.22.04.1 amd64        Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                             <none>                      <none>       (no description available)
ii  libnvidia-fbc1-535:amd64                   535.129.03-0ubuntu0.22.04.1 amd64        NVIDIA OpenGL-based Framebuffer Capture runtime>
ii  libnvidia-fbc1-535:i386                    535.129.03-0ubuntu0.22.04.1 i386         NVIDIA OpenGL-based Framebuffer Capture runtime>
un  libnvidia-gl                               <none>                      <none>       (no description available)
ii  libnvidia-gl-535:amd64                     535.129.03-0ubuntu0.22.04.1 amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and >
ii  libnvidia-gl-535:i386                      535.129.03-0ubuntu0.22.04.1 i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and >
ii  libnvidia-ml-dev:amd64                     11.5.50~11.5.1-1ubuntu1     amd64        NVIDIA Management Library (NVML) development fi>
un  libnvidia-ml.so.1                          <none>                      <none>       (no description available)
ii  linux-modules-nvidia-535-6.2.0-37-generic  6.2.0-37.38~22.04.1         amd64        Linux kernel nvidia modules for version 6.2.0-37
ii  linux-modules-nvidia-535-generic-hwe-22.04 6.2.0-37.38~22.04.1         amd64        Extra drivers for nvidia-535 for the generic-hw>
ii  linux-objects-nvidia-535-6.2.0-37-generic  6.2.0-37.38~22.04.1         amd64        Linux kernel nvidia modules for version 6.2.0-3>
ii  linux-signatures-nvidia-6.2.0-37-generic   6.2.0-37.38~22.04.1         amd64        Linux kernel signatures for nvidia modules for >
un  nvidia-384                                 <none>                      <none>       (no description available)
un  nvidia-390                                 <none>                      <none>       (no description available)
un  nvidia-common                              <none>                      <none>       (no description available)
un  nvidia-compute-utils                       <none>                      <none>       (no description available)
ii  nvidia-compute-utils-535                   535.129.03-0ubuntu0.22.04.1 amd64        NVIDIA compute utilities
un  nvidia-container-runtime                   <none>                      <none>       (no description available)
un  nvidia-container-runtime-hook              <none>                      <none>       (no description available)
ii  nvidia-container-toolkit                   1.14.3-1                    amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base              1.14.3-1                    amd64        NVIDIA Container Toolkit Base
ii  nvidia-cuda-dev:amd64                      11.5.1-1ubuntu1             amd64        NVIDIA CUDA development files
un  nvidia-cuda-doc                            <none>                      <none>       (no description available)
ii  nvidia-cuda-gdb                            11.5.114~11.5.1-1ubuntu1    amd64        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                        11.5.1-1ubuntu1             amd64        NVIDIA CUDA development toolkit
ii  nvidia-cuda-toolkit-doc                    11.5.1-1ubuntu1             all          NVIDIA CUDA and OpenCL documentation
un  nvidia-dkms-535                            <none>                      <none>       (no description available)
un  nvidia-docker                              <none>                      <none>       (no description available)
ii  nvidia-docker2                             2.14.0-1                    all          NVIDIA Container Toolkit meta-package
ii  nvidia-driver-535                          535.129.03-0ubuntu0.22.04.1 amd64        NVIDIA driver metapackage
un  nvidia-driver-binary                       <none>                      <none>       (no description available)
ii  nvidia-firmware-535-535.129.03             535.129.03-0ubuntu0.22.04.1 amd64        Firmware files used by the kernel module
un  nvidia-firmware-535-server-535.129.03      <none>                      <none>       (no description available)
un  nvidia-kernel-common                       <none>                      <none>       (no description available)
ii  nvidia-kernel-common-535                   535.129.03-0ubuntu0.22.04.1 amd64        Shared files used with the kernel module
un  nvidia-kernel-source                       <none>                      <none>       (no description available)
ii  nvidia-kernel-source-535                   535.129.03-0ubuntu0.22.04.1 amd64        NVIDIA kernel source package
un  nvidia-libopencl1                          <none>                      <none>       (no description available)
un  nvidia-libopencl1-dev                      <none>                      <none>       (no description available)
ii  nvidia-opencl-dev:amd64                    11.5.1-1ubuntu1             amd64        NVIDIA OpenCL development files
un  nvidia-opencl-icd                          <none>                      <none>       (no description available)
un  nvidia-persistenced                        <none>                      <none>       (no description available)
un  nvidia-prebuilt-kernel                     <none>                      <none>       (no description available)
ii  nvidia-prime                               0.8.17.1                    all          Tools to enable NVIDIA's Prime
ii  nvidia-profiler                            11.5.114~11.5.1-1ubuntu1    amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings                            510.47.03-0ubuntu1          amd64        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary                     <none>                      <none>       (no description available)
un  nvidia-smi                                 <none>                      <none>       (no description available)
un  nvidia-utils                               <none>                      <none>       (no description available)
ii  nvidia-utils-535                           535.129.03-0ubuntu0.22.04.1 amd64        NVIDIA driver support binaries
ii  nvidia-visual-profiler                     11.5.114~11.5.1-1ubuntu1    amd64        NVIDIA Visual Profiler for CUDA and OpenCL
ii  xserver-xorg-video-nvidia-535              535.129.03-0ubuntu0.22.04.1 amd64        NVIDIA binary Xorg driver
szhang99-bu commented 7 months ago

I have seen on there thread that this issue can be solved by editing /etc/nvidia-container-runtime/config.toml and changing:

[nvidia-container-cli]
no-cgroups = true

[nvidia-container-runtime]
debug = "/tmp/nvidia-container-runtime.log"

Is this correct today? Because the thread is nearly 4 years ago. And I do not have a file located in "/tmp/nvidia-container-runtime.log"

And this is the current setting for /etc/nvidia-container-runtime/config.toml

#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"
elezar commented 7 months ago

@szhang99-bu for completeness, how is Docker installed? Is this Docker Desktop?

szhang99-bu commented 7 months ago

@elezar Docker is installed following Docker Desktop installation guide for Ubuntu from docker website using DEB package. And Daemon has been configured to NVIDIA in the setting.

{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia"
}
elezar commented 5 months ago

@szhang99-bu the toolkit currently only supports docker-ce and not Docker Desktop on Linux.

itsklimov commented 2 months ago

Is this support coming?