The "runtimes" settings in docker daemon config file for nvidia-docker on Ubuntu 20.04 will defeat the docker service. #1420

hongyi-zhao commented 3 years ago

I'm on Ubuntu 20.04, and I installed the nvidia-docker according to the installation guide. I write the following script for this job:

ID=$(lsb_release -si | tr '[A-Z]' '[a-z]')
VERSION_ID=$(lsb_release -sr)
if [[ $VERSION_ID == "20.10" ]]; then

curl -s -x socks5:// -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -x socks5:// -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list >/dev/null

sudo apt-get update && sudo sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

But the last command failed and, as a result, the docker service is no longer working anymore. After the installation of nvidia-docker2, I've the following content in the /etc/docker/daemon.json:

$ cat /etc/docker/daemon.json
    "dns" : [""]
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []

But with the above settings, the docker service won't continue to work anymore:

$ sudo systemctl restart docker
Job for docker.service failed because the control process exited with error code.
See "systemctl status docker.service" and "journalctl -xe" for details.

OTOH, if I remove the "runtimes" settings from the /etc/docker/daemon.json file, i.e., as below, the docker service will be able to work again.

werner@X10DAi:~$ cat /etc/docker/daemon.json
    "dns" : [""]
werner@X10DAi:~$ sudo systemctl restart docker
werner@X10DAi:~$ docker info 
 Debug Mode: false

 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
   Profile: default
 Kernel Version: 5.4.0-52-generic
 Operating System: Ubuntu 20.04 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 88
 Total Memory: 251.8GiB
 Name: X10DAi
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy:
 HTTPS Proxy:
 No Proxy: localhost,,packages.deepin.com,*.cn
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
 Live Restore Enabled: false

WARNING: No swap limit support

And furthermore, even I don't set the "runtimes" section in the /etc/docker/daemon.json, the base CUDA container testing still can succeed as shown below:

$ docker run --rm --gpus all nvidia/cuda nvidia-smi
Thu Nov 19 11:58:05 2020       
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 207...  On   | 00000000:02:00.0  On |                  N/A |
|  0%   36C    P8    19W / 215W |    261MiB /  7977MiB |      1%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |

The detail nvidia driver and cuda info are shown as below:

So, I wan to know whether I really should add the "runtimes" settings of nvidia-docker for docker in its daemon config file, i.e., /etc/docker/daemon.json.

Any hints for this problem will be highly appreciated?

Regards, HY

klueska commented 3 years ago

You need a comma after the line:

    "dns" : [""]
klueska commented 3 years ago


And furthermore, even I don't set the "runtimes" section in the /etc/docker/daemon.json, the base CUDA container testing still can succeed as shown below:

Yes, if you run with the --gpus option, you don't need to actually install nvidia-docker2, but rather just nvidia-container.-toolkit. At this point, nvidia-docker2 is mostly necessary only if you plan on using it in a Kubernetes cluster (because there is no way to pass --gpus down to docker from within Kubernetes).

hongyi-zhao commented 3 years ago

Thank you so much for highlighting my mistake and offering such thorough explanations.

paxdriver commented 8 months ago

also, i believe "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } } is how you would pass args, not "runtimesArgs". I'm a newb so maybe ignore me if there are 2 acceptable values but mine only had "args" :)

elezar commented 8 months ago

args is the correct entry. Note that nvidia-docker is deprecated and no longer installs / overwrites the daemon.json file. This should be configured manually after installing the nvidia-container-toolkit package(s) using the nvidia-ctk command.

See https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html