docker / compose

Define and run multi-container applications with Docker
https://docs.docker.com/compose/
Apache License 2.0
33.79k stars 5.2k forks source link

Nvidia Runtime Does Not Work via Docker Compose #12203

Open nomah98 opened 1 day ago

nomah98 commented 1 day ago

Description

Going from docker-compose-plugin/2.29.1 to docker-compose-plugin/jammy 2.29.7, the runtime field of the docker compose file does not enable the specified nvida runtime in my dockerfile. However, running the same image with the argument --runtime nvidia actually enables the Nvidia runtime in the container. I have other Nvidia devices running docker-compose-plugin/2.29.1 that do not have this issue.

Steps To Reproduce

On a Jetson Orin-NX with docker-compose-plugin/jammy 2.29.7, use docker compose to start a container via docker-compose that has fields such as

image: MY_IMAGE
container_name: MY_CONTAINER
    runtime: nvidia
    network_mode: host
    cap_add: [SYS_TIME]
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            capabilities:
              - utility # nvidia-smi
              - compute # CUDA
              - video   # NVDEC/NVENC/NVCUVID

then try to import something that uses an nvidia shared object from .tensorrt import * and see an error like ImportError: /usr/lib/aarch64-linux-gnu/nvidia/libnvdla_compiler.so: file too short

Run the same image with docker run --runtime nvidia -it MY_IMAGE bash then try to import something that uses an nvidia shared object from .tensorrt import *

No error.

Compose Version

docker-compose-plugin/jammy 2.29.7

Docker Environment

Client: Docker Engine - Community Version: 27.3.1 Context: default Debug Mode: false Plugins: buildx: Docker Buildx (Docker Inc.) Version: v0.17.1 Path: /usr/libexec/docker/cli-plugins/docker-buildx compose: Docker Compose (Docker Inc.) Version: v2.29.7 Path: /usr/libexec/docker/cli-plugins/docker-compose

Server: Containers: 1 Running: 1 Paused: 0 Stopped: 0 Images: 1 Server Version: 27.3.1 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 nvidia runc Default Runtime: runc Init Binary: docker-init containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c runc version: v1.1.14-0-g2c9f560 init version: de40ad0 Security Options: seccomp Profile: builtin cgroupns Kernel Version: 5.15.136-tegra Operating System: Ubuntu 22.04.4 LTS OSType: linux Architecture: aarch64 CPUs: 8 Total Memory: 15.29GiB Name: rudi-nx ID: 3c5b7ecd-713f-4d6e-ac9a-f6cfe3c2112f Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false Insecure Registries: 192.168.11.200:5000 127.0.0.0/8 Live Restore Enabled: false

WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled

Anything else?

Francisco encountered the same issue here

Like Francisco, I was able to make this work by downgrading docker-ce and docker-compose-plugin

ndeloof commented 21 hours ago

I don't have an nvidia device so I could try to reproduce, but using same version if I add a runtime: nvidia attribute to my compose file I get: Error response from daemon: unknown or invalid runtime name: nvidia which seems to demonstrate container is well configured to run with nvidia runtime.

Can you please capture docker inspect MY_CONTAINER for both compose versions running your application, so we can compare the container configuration and differences with the newer compose version ?

thaJeztah commented 20 hours ago

I think this may be related to a change contributed by NVIDIA;

thaJeztah commented 20 hours ago

I wonder though if there's a difference here between how cli options and compose options are handled, or if the same issue happens on the CLI ("explicitly set to 0" vs "not set")