NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.2k stars 2.03k forks source link

The "runtimes" settings in docker daemon config file for nvidia-docker on Ubuntu 20.04 will defeat the docker service. #1420

Closed hongyi-zhao closed 3 years ago

hongyi-zhao commented 3 years ago

I'm on Ubuntu 20.04, and I installed the nvidia-docker according to the installation guide. I write the following script for this job:

ID=$(lsb_release -si | tr '[A-Z]' '[a-z]')
VERSION_ID=$(lsb_release -sr)
if [[ $VERSION_ID == "20.10" ]]; then
  VERSION_ID=20.04
fi
distribution=$ID$VERSION_ID

curl -s -x socks5://127.0.0.1:18888 -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -x socks5://127.0.0.1:18888 -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list >/dev/null

sudo apt-get update && sudo sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

But the last command failed and, as a result, the docker service is no longer working anymore. After the installation of nvidia-docker2, I've the following content in the /etc/docker/daemon.json:

$ cat /etc/docker/daemon.json
{
    "dns" : ["172.17.0.1"]
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

But with the above settings, the docker service won't continue to work anymore:

$ sudo systemctl restart docker
Job for docker.service failed because the control process exited with error code.
See "systemctl status docker.service" and "journalctl -xe" for details.

OTOH, if I remove the "runtimes" settings from the /etc/docker/daemon.json file, i.e., as below, the docker service will be able to work again.

werner@X10DAi:~$ cat /etc/docker/daemon.json
{
    "dns" : ["172.17.0.1"]
}
werner@X10DAi:~$ sudo systemctl restart docker
werner@X10DAi:~$ docker info 
Client:
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-52-generic
 Operating System: Ubuntu 20.04 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 88
 Total Memory: 251.8GiB
 Name: X10DAi
 ID: X7KS:KLNJ:LXRV:4OB2:3QLW:FFTY:7KZQ:O4KT:MQPP:3O5P:MP2K:ZI6B
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http://172.17.0.1:8080/
 HTTPS Proxy: http://172.17.0.1:8080/
 No Proxy: localhost,127.0.0.1,packages.deepin.com,*.cn
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

And furthermore, even I don't set the "runtimes" section in the /etc/docker/daemon.json, the base CUDA container testing still can succeed as shown below:

$ docker run --rm --gpus all nvidia/cuda nvidia-smi
Thu Nov 19 11:58:05 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 207...  On   | 00000000:02:00.0  On |                  N/A |
|  0%   36C    P8    19W / 215W |    261MiB /  7977MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The detail nvidia driver and cuda info are shown as below:

$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Thu Nov 19 20:03:05 2020
Driver Version                            : 455.45.01
CUDA Version                              : 11.1

Attached GPUs                             : 1
GPU 00000000:02:00.0
    Product Name                          : GeForce RTX 2070 SUPER
    Product Brand                         : GeForce
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-64cfbbd9-dca8-6072-6943-e720c8bcd9bc
    Minor Number                          : 0
    VBIOS Version                         : 90.04.76.40.9D
    MultiGPU Board                        : No
    Board ID                              : 0x200
    GPU Part Number                       : N/A
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x02
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1E8410DE
        Bus Id                            : 00000000:02:00.0
        Sub System Id                     : 0x140A7377
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 15000 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 7977 MiB
        Used                              : 249 MiB
        Free                              : 7728 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 6 MiB
        Free                              : 250 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 1 %
        Memory                            : 3 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 36 C
        GPU Shutdown Temp                 : 95 C
        GPU Slowdown Temp                 : 92 C
        GPU Max Operating Temp            : 88 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 21.30 W
        Power Limit                       : 215.00 W
        Default Power Limit               : 215.00 W
        Enforced Power Limit              : 215.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 258.00 W
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1963
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 35 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2509
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 101 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 4227
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 92 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 4880
            Type                          : G
            Name                          : /opt/apps/com.baidu.fcitx-baidupinyin/files/bin/baidu-qimpanel
            Used GPU Memory               : 7 MiB

So, I wan to know whether I really should add the "runtimes" settings of nvidia-docker for docker in its daemon config file, i.e., /etc/docker/daemon.json.

Any hints for this problem will be highly appreciated?

Regards, HY

klueska commented 3 years ago

You need a comma after the line:

    "dns" : ["172.17.0.1"]
klueska commented 3 years ago

Regarding:

And furthermore, even I don't set the "runtimes" section in the /etc/docker/daemon.json, the base CUDA container testing still can succeed as shown below:

Yes, if you run with the --gpus option, you don't need to actually install nvidia-docker2, but rather just nvidia-container.-toolkit. At this point, nvidia-docker2 is mostly necessary only if you plan on using it in a Kubernetes cluster (because there is no way to pass --gpus down to docker from within Kubernetes).

hongyi-zhao commented 3 years ago

Thank you so much for highlighting my mistake and offering such thorough explanations.

paxdriver commented 8 months ago

also, i believe "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } } is how you would pass args, not "runtimesArgs". I'm a newb so maybe ignore me if there are 2 acceptable values but mine only had "args" :)

elezar commented 8 months ago

args is the correct entry. Note that nvidia-docker is deprecated and no longer installs / overwrites the daemon.json file. This should be configured manually after installing the nvidia-container-toolkit package(s) using the nvidia-ctk command.

See https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html