NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.45k stars 261 forks source link

Docker doesn't recognize nvidia as runtime after recent docker update #653

Closed leonhartyao closed 2 months ago

leonhartyao commented 2 months ago

Issue

Today, I can't run my docker container with --runtime="nvidia" anymore.

I have reinstalled nvidia-container-toolkit, nvidia-container-runtime is in the PATH. I haven't touched the config file:

{
    "default-runtime": "nvidia",
    "insecure-registries": [
        "blabla"
    ],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/bin/nvidia-container-runtime"
        }
    }
}

Regardless of restarting docker service and rebooting, the runtime is always containerd and specifying nvidia as runtime leads to docker: Error response from daemon: unknown or invalid runtime name: nvidia.

docker info|grep -i runtime
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc

It is most likely due to a recent apt update, I noticed that docker was updated.

How to reproduce

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Information

uname -a
Linux My-Thinkpad 6.5.0-1027-oem #28-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 25 13:32:46 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Thu Aug 15 12:05:55 2024
Driver Version                            : 535.183.01
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA RTX 2000 Ada Generation Laptop GPU
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ada Lovelace
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-eb3986bd-d914-e30b-f679-569151640211
    Minor Number                          : 0
    VBIOS Version                         : 95.07.28.00.6A
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 28B8-975-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G002.0000.00.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x28B810DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x231617AA
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 5
            Link Width
                Max                       : 8x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 8188 MiB
        Reserved                          : 247 MiB
        Used                              : 52 MiB
        Free                              : 7887 MiB
    BAR1 Memory Usage
        Total                             : 8192 MiB
        Used                              : 51 MiB
        Free                              : 8141 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 3 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 64 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 49 C
        GPU T.Limit Temp                  : 37 C
        GPU Shutdown T.Limit Temp         : -12 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 87 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A
    GPU Power Readings
        Power Draw                        : 4.93 W
        Current Power Limit               : 35.00 W
        Requested Power Limit             : 35.00 W
        Default Power Limit               : 35.00 W
        Min Power Limit                   : 5.00 W
        Max Power Limit                   : 65.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 1455 MHz
        Memory                            : 8001 MHz
    Default Applications Clocks
        Graphics                          : 1455 MHz
        Memory                            : 8001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 3105 MHz
        SM                                : 3105 MHz
        Memory                            : 8001 MHz
        Video                             : 2415 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 625.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2949
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 2 MiB
docker version
Client: Docker Engine - Community
 Version:           27.1.2
 API version:       1.43 (downgraded from 1.46)
 Go version:        go1.21.13
 Git commit:        d01f264
 Built:             Mon Aug 12 11:50:12 2024
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          24.0.5
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.8
  Git commit:       a61e2b4
  Built:            Sat Oct  7 00:14:30 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.6.21
  GitCommit:        3dce8eb055cbb6872793272b4f20ed16117344f8
 runc:
  Version:          1.1.7
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                       Version                     Architecture Description
+++-==========================================-===========================-============-=======================================================================
un  libgldispatch0-nvidia                      <none>                      <none>       (no description available)
ii  libnvidia-cfg1-535:amd64                   535.183.01-0ubuntu0.22.04.1 amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                         <none>                      <none>       (no description available)
un  libnvidia-common                           <none>                      <none>       (no description available)
ii  libnvidia-common-535                       535.183.01-0ubuntu0.22.04.1 all          Shared files used by the NVIDIA libraries
un  libnvidia-compute                          <none>                      <none>       (no description available)
ii  libnvidia-compute-535:amd64                535.183.01-0ubuntu0.22.04.1 amd64        NVIDIA libcompute package
ii  libnvidia-compute-535:i386                 535.183.01-0ubuntu0.22.04.1 i386         NVIDIA libcompute package
ii  libnvidia-container-tools                  1.16.1-1                    amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                 1.16.1-1                    amd64        NVIDIA container runtime library
un  libnvidia-decode                           <none>                      <none>       (no description available)
ii  libnvidia-decode-535:amd64                 535.183.01-0ubuntu0.22.04.1 amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-535:i386                  535.183.01-0ubuntu0.22.04.1 i386         NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                           <none>                      <none>       (no description available)
ii  libnvidia-encode-535:amd64                 535.183.01-0ubuntu0.22.04.1 amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-535:i386                  535.183.01-0ubuntu0.22.04.1 i386         NVENC Video Encoding runtime library
un  libnvidia-encode1                          <none>                      <none>       (no description available)
un  libnvidia-extra                            <none>                      <none>       (no description available)
ii  libnvidia-extra-535:amd64                  535.183.01-0ubuntu0.22.04.1 amd64        Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                             <none>                      <none>       (no description available)
ii  libnvidia-fbc1-535:amd64                   535.183.01-0ubuntu0.22.04.1 amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-535:i386                    535.183.01-0ubuntu0.22.04.1 i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                               <none>                      <none>       (no description available)
ii  libnvidia-gl-535:amd64                     535.183.01-0ubuntu0.22.04.1 amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-535:i386                      535.183.01-0ubuntu0.22.04.1 i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ml.so.1                          <none>                      <none>       (no description available)
rc  linux-modules-nvidia-535-6.5.0-1023-oem    6.5.0-1023.24               amd64        Linux kernel nvidia modules for version 6.5.0-1023
rc  linux-modules-nvidia-535-6.5.0-1024-oem    6.5.0-1024.25+1             amd64        Linux kernel nvidia modules for version 6.5.0-1024
rc  linux-modules-nvidia-535-6.5.0-1025-oem    6.5.0-1025.26+1             amd64        Linux kernel nvidia modules for version 6.5.0-1025
ii  linux-modules-nvidia-535-6.5.0-1027-oem    6.5.0-1027.28               amd64        Linux kernel nvidia modules for version 6.5.0-1027
ii  linux-modules-nvidia-535-6.8.0-40-generic  6.8.0-40.40~22.04.3         amd64        Linux kernel nvidia modules for version 6.8.0-40
ii  linux-modules-nvidia-535-generic-hwe-22.04 6.8.0-40.40~22.04.3         amd64        Extra drivers for nvidia-535 for the generic-hwe-22.04 flavour
ii  linux-modules-nvidia-535-oem-22.04         6.8.0-40.40~22.04.3         amd64        Extra drivers for nvidia-535-oem-22.04 (dummy transitional package)
rc  linux-objects-nvidia-535-6.5.0-1023-oem    6.5.0-1023.24               amd64        Linux kernel nvidia modules for version 6.5.0-1023 (objects)
rc  linux-objects-nvidia-535-6.5.0-1024-oem    6.5.0-1024.25+1             amd64        Linux kernel nvidia modules for version 6.5.0-1024 (objects)
rc  linux-objects-nvidia-535-6.5.0-1025-oem    6.5.0-1025.26+1             amd64        Linux kernel nvidia modules for version 6.5.0-1025 (objects)
ii  linux-objects-nvidia-535-6.5.0-1027-oem    6.5.0-1027.28               amd64        Linux kernel nvidia modules for version 6.5.0-1027 (objects)
ii  linux-objects-nvidia-535-6.8.0-40-generic  6.8.0-40.40~22.04.3         amd64        Linux kernel nvidia modules for version 6.8.0-40 (objects)
un  linux-signatures-nvidia-6.5.0-1023-oem     <none>                      <none>       (no description available)
un  linux-signatures-nvidia-6.5.0-1024-oem     <none>                      <none>       (no description available)
un  linux-signatures-nvidia-6.5.0-1025-oem     <none>                      <none>       (no description available)
ii  linux-signatures-nvidia-6.5.0-1027-oem     6.5.0-1027.28               amd64        Linux kernel signatures for nvidia modules for version 6.5.0-1027-oem
ii  linux-signatures-nvidia-6.8.0-40-generic   6.8.0-40.40~22.04.3         amd64        Linux kernel signatures for nvidia modules for version 6.8.0-40-generic
un  nvidia-384                                 <none>                      <none>       (no description available)
un  nvidia-390                                 <none>                      <none>       (no description available)
un  nvidia-common                              <none>                      <none>       (no description available)
un  nvidia-compute-utils                       <none>                      <none>       (no description available)
ii  nvidia-compute-utils-535                   535.183.01-0ubuntu0.22.04.1 amd64        NVIDIA compute utilities
ii  nvidia-container-runtime                   3.14.0-1                    all          NVIDIA Container Toolkit meta-package
un  nvidia-container-runtime-hook              <none>                      <none>       (no description available)
ii  nvidia-container-toolkit                   1.16.1-1                    amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base              1.16.1-1                    amd64        NVIDIA Container Toolkit Base
un  nvidia-dkms-535                            <none>                      <none>       (no description available)
ii  nvidia-driver-535                          535.183.01-0ubuntu0.22.04.1 amd64        NVIDIA driver metapackage
un  nvidia-driver-binary                       <none>                      <none>       (no description available)
ii  nvidia-firmware-535-535.183.01             535.183.01-0ubuntu0.22.04.1 amd64        Firmware files used by the kernel module
un  nvidia-firmware-535-server-535.183.01      <none>                      <none>       (no description available)
un  nvidia-kernel-common                       <none>                      <none>       (no description available)
ii  nvidia-kernel-common-535                   535.183.01-0ubuntu0.22.04.1 amd64        Shared files used with the kernel module
un  nvidia-kernel-source                       <none>                      <none>       (no description available)
ii  nvidia-kernel-source-535                   535.183.01-0ubuntu0.22.04.1 amd64        NVIDIA kernel source package
un  nvidia-libopencl1-dev                      <none>                      <none>       (no description available)
un  nvidia-opencl-icd                          <none>                      <none>       (no description available)
un  nvidia-persistenced                        <none>                      <none>       (no description available)
un  nvidia-prebuilt-kernel                     <none>                      <none>       (no description available)
ii  nvidia-prime                               0.8.17.1                    all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                            510.47.03-0ubuntu1          amd64        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary                     <none>                      <none>       (no description available)
un  nvidia-smi                                 <none>                      <none>       (no description available)
un  nvidia-utils                               <none>                      <none>       (no description available)
ii  nvidia-utils-535                           535.183.01-0ubuntu0.22.04.1 amd64        NVIDIA driver support binaries
ii  xserver-xorg-video-nvidia-535              535.183.01-0ubuntu0.22.04.1 amd64        NVIDIA binary Xorg driver
nvidia-container-cli -V
cli-version: 1.16.1
lib-version: 1.16.1
build date: 2024-07-23T14:57+00:00
build revision: 4c2494f16573b585788a42e9c7bee76ecd48c73d
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
leonhartyao commented 2 months ago

I specified docker version with

sudo apt-get install docker-ce=5:27.1.1-1~ubuntu.22.04~jammy
sudo systemctl restart docker

And it works again. It seems that the client and server versions don't match before (27.1.2 vs 24.0.5)

Current:

docker version
Client: Docker Engine - Community
 Version:           27.1.2
 API version:       1.46
 Go version:        go1.21.13
 Git commit:        d01f264
 Built:             Mon Aug 12 11:50:12 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.1.1
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.12
  Git commit:       cc13f95
  Built:            Tue Jul 23 19:57:01 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.20
  GitCommit:        8fc6bcff51318944179630522a095cc9dbf9f353
 nvidia:
  Version:          1.1.13
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0