2b-t / docker-for-robotics

Collection of best practices when working with Docker/Docker-Compose and the Robot Operating System (ROS/ROS 2) in simulation as well as with real-hardware with real-time requirements
MIT License
206 stars 13 forks source link

[enhancement] nvidia-container-runtime is deprecated - tried to go for nvidia-container-toolkit but got error #1

Closed LeroyOP0 closed 6 months ago

LeroyOP0 commented 6 months ago

nvidia-container-runtime works well exactly as instructed by the guide.

But it deprecated and advised to switch to [https://github.com/NVIDIA/nvidia-container-toolkit?tab=readme-ov-file]

Tried to simply change the daemon.json:

sudo tee /etc/docker/daemon.json <<EOF
{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-toolkit",
            "runtimeArgs": []
        }
    }
}
EOF

but got an error when using "docker compose ... up":

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/e5a45f0f12e0c02ec1708dc89576a0972ab477a5c45a5db8cd6a312605b11084/log.json: no such file or directory): /usr/bin/nvidia-container-toolkit did not terminate successfully: exit status 2: flag provided but not defined: -root
Usage of /usr/bin/nvidia-container-toolkit:
  -config string
        configuration file
  -debug
        enable debug output
  -version
        enable version output

Commands:
  prestart
        run the prestart hook
  poststart
        no-op
  poststop
        no-op
: unknown
2b-t commented 6 months ago

Hi @LeroyOP0, So did I understand you correctly that you first installed nvidia-container-runtime, then uninstalled it and installed nvidia-container-toolkit instead? If so what is the output of $ cat /etc/docker/daemon.json now? Did you reboot or restart the Docker daemon after upgrading? I personally have not installed the nvidia-container-toolkit yet but some PhD students of ours did so recently and as far as I remember they did not need to add anything to the /etc/docker/daemon.json. Could you try to remove the contents of it, reboot the system, give it another try and send me the output of $ docker info?

LeroyOP0 commented 6 months ago

Collecting the info and getting back to you. Thanks champ

On Mon, 26 Feb 2024 at 11:46 Tobit Flatscher @.***> wrote:

Hi @LeroyOP0 https://github.com/LeroyOP0, So did I understand you correctly that you first installed nvidia-container-runtime, then uninstalled it and installed nvidia-container-toolkit instead? If so what is the output of $ cat /etc/docker/daemon.json now? Did you reboot or restart the Docker daemon after upgrading? I personally have not installed the nvidia-container-toolkit yet but some PhD students of ours did so recently and as far as I remember they did not need to add anything to the /etc/docker/daemon.json. Could you try to remove the contents of it, reboot the system, give it another try and send me the output of $ docker info?

— Reply to this email directly, view it on GitHub https://github.com/2b-t/docker-for-robotics/issues/1#issuecomment-1963704787, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTTZ4ADZ5YKY6NSZ4ZFRIDYVRKY5AVCNFSM6AAAAABDZRDDDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRTG4YDINZYG4 . You are receiving this because you were mentioned.Message ID: @.***>

LeroyOP0 commented 6 months ago

For:

cat /etc/docker/daemon.json

I get:

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

and everything works perfect. We can see that in docker info

Client: Docker Engine - Community
 Version:    25.0.3
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.12.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.24.5
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 1
  Running: 0
  Paused: 0
  Stopped: 1
 Images: 12
 Server Version: 25.0.3
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.15.0-97-generic
 Operating System: Ubuntu 20.04.6 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 13.5GiB
 Name: XXXXX
 ID: e5c9c593-e6a4-4fe7-ae2b-d840d9af1af6
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

But considering the package is deprecated and I want to switch to the recommended nvidia-container-toolkit, what steps should I follow to configure that the nvidia runtime to be that one.

In my installed packages (apt list --installed | grep nvidia-container) I get both to be installed:

libnvidia-container-tools/bionic,now 1.13.5-1 amd64 [installed,automatic]
libnvidia-container1/bionic,now 1.13.5-1 amd64 [installed,automatic]
nvidia-container-runtime/bionic,now 3.13.0-1 all [installed]
nvidia-container-toolkit-base/bionic,now 1.13.5-1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,now 1.13.5-1 amd64 [installed,automatic]

Shoud I simply uninstall the "nvidia-container-runtime"?

2b-t commented 6 months ago

I would uninstall the nvidia-container-runtime and delete the contents of /etc/docker/daemon.json, then restart the system and check the output of $ docker info to make sure that it still has nvidia under Runtimes. If that does not work uninstall both of them and reinstall the nvidia-container-toolkit only. Let me know if that works...

LeroyOP0 commented 6 months ago

After uninstalling with sudo apt remove nvidia-container-runtime and rechecking with apt list --installed | grep nvidia-container we get:

libnvidia-container-tools/bionic,now 1.13.5-1 amd64 [installed,auto-removable]
libnvidia-container1/bionic,now 1.13.5-1 amd64 [installed,auto-removable]
nvidia-container-toolkit-base/bionic,now 1.13.5-1 amd64 [installed,auto-removable]
nvidia-container-toolkit/bionic,now 1.13.5-1 amd64 [installed,auto-removable]

No nvidia-container-runtime.

Removed the content of the dameon.json as well.

Then restarting with sudo systemctl daemon-reload and sudo systemctl restart docker, and checking docker info | grep run doesn't list nvidia runtime any longer - not good I guess.

 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 runc version: v1.1.12-0-g51d5e94

Now i'll try to uninstall nvidia-container-toolkit as well

LeroyOP0 commented 6 months ago

Reinstalling nvidia-container-toolkit, restarting docker, and checking docker info shows that there's no nvidia runtime. Only runc.

 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 runc version: v1.1.12-0-g51d5e94

How does using nvidia-container-toolkit may create the required nvidia runtime?

LeroyOP0 commented 6 months ago

Oops missed some steps.... I think I didn't configure docker from the installation guide

Rechecking

LeroyOP0 commented 6 months ago

BINGO!

I just missed to configure docker and then it sets the nvidia runtime.

Thanks @2b-t you're awesome.

2b-t commented 6 months ago

Excellent, I will add a corresponding comment inside my guide. You are welcome, @LeroyOP0!