Open nonoy-suguitan opened 6 months ago
You should be able to force containerd by setting the RUNTIME envvar: https://github.com/NVIDIA/nvidia-container-toolkit/blob/1ddc859700c0d698f7f155fdbf7ae6f77ea0c1f5/tools/container/nvidia-toolkit/run.go#L78
I'm not sure why docker is being detected by the operator. Which version are you installing?
I set the RUNTIME
envvar via helm:
helm install --wait gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set 'toolkit.env[0].name=RUNTIME' \
--set 'toolkit.env[0].value=containerd' \
--set 'toolkit.env[1].name=CONTAINERD_CONFIG' \
--set 'toolkit.env[1].value=/etc/containerd/config.toml' \
--set 'toolkit.env[2].name=CONTAINERD_SOCKET' \
--set 'toolkit.env[2].value=/run/containerd/containerd.sock' \
--set 'toolkit.env[3].name=CONTAINERD_RUNTIME_CLASS' \
--set 'toolkit.env[3].value=nvidia' \
--set 'toolkit.env[4].name=CONTAINERD_SET_AS_DEFAULT' \
--set-string 'toolkit.env[4].value=true'
but the toolkit continues to use the docker RUNTIME
nvidia-container-toolkit-ds-runtime-containerd-orig.yaml.txt
So I manually edited the daemonset to update the RUNTIME
, along with the corresponding containerd configurations (volume, mounts, paths)
nvidia-container-toolkit-ds-runtime-containerd-mod.yaml.txt
Which caused docker in the worker node to become unavailable (thus the worker node went into a NotReady state).
$ journalctl -u docker
Mar 22 16:27:38 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:38.711843457Z" level=info msg="Got signal to reload configuration, reloading from: /etc/docker/daemon.json"
Mar 22 16:27:38 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:38.711981033Z" level=error msg="unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: default-runtime: (from f
Mar 22 16:27:38 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:38.737324033Z" level=info msg="ignoring event" container=42c6c9c999618843c25e61a536df22755a0b8fe8dd6fccd4fc41fd9aa206c72f module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 22 16:27:38 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:38.904699844Z" level=info msg="ignoring event" container=387d2d2d3de14bdb44c076c6ec637fd017000da66128cec1fc51d7ed8937d81c module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 22 16:27:40 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:40.996234138Z" level=info msg="ignoring event" container=9603f5e4961c4ddb948b8e8078d925b45e7cce3528565f14a2678252955d304a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120253144Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=plugins.moby
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120315826Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=plugins.moby
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120337445Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial uni
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120364292Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial uni
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120258368Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=moby
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120438486Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal systemd[1]: Stopping Docker Application Container Engine...
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.195548180Z" level=info msg="Processing signal 'terminated'"
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.199483868Z" level=info msg="Daemon shutdown complete"
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal systemd[1]: Stopped Docker Application Container Engine.
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal systemd[1]: Starting Docker Application Container Engine...
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal dockerd[44320]: unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: default-runtime: (from flag: nvidia, from file: runc)
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal systemd[1]: Failed to start Docker Application Container Engine.
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal systemd[1]: Unit docker.service entered failed state.
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal systemd[1]: docker.service failed.
Mar 22 16:29:55 ip-192-168-9-254.ec2.internal systemd[1]: docker.service holdoff time over, scheduling restart.
Mar 22 16:29:55 ip-192-168-9-254.ec2.internal systemd[1]: Stopped Docker Application Container Engine.
Mar 22 16:29:55 ip-192-168-9-254.ec2.internal systemd[1]: Starting Docker Application Container Engine...
Mar 22 16:29:55 ip-192-168-9-254.ec2.internal dockerd[44350]: unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: default-runtime: (from flag: nvidia, from file: runc)
I was able to verify that:
/etc/containerd/config.toml
and /run/containerd/containerd.sock
exist on the worker nodejournalctl -u containerd
Docker version of worker node:
$ docker version
Client:
Version: 20.10.25
API version: 1.41
Go version: go1.20.12
Git commit: b82b9f3
Built: Fri Dec 29 20:37:18 2023
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.25
API version: 1.41 (minimum version 1.12)
Go version: go1.20.12
Git commit: 5df983c
Built: Fri Dec 29 20:38:05 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.11
GitCommit: 64b8a811b07ba6288238eefc14d898ee0b5b99ba
nvidia:
Version: 1.1.11
GitCommit: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
docker-init:
Version: 0.19.0
GitCommit: de40ad0
gpu-operator version
% helm list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator gpu-operator 1 2024-03-22 09:12:12.634727 -0700 PDT deployed gpu-operator-v23.9.2 v23.9.2
For what it's worth, I've got this working on a k8s v1.25 cluster (where containerd runtime is used and not docker).
I'm just wondering if there's a way to bypass the docker dependency (that is, directly using containerd), while k8s is using docker (in k8s versions such as v1.23).
@nonoy-suguitan that configuration doesn't make sense. If you have setup kubelet to use dockershim and docker is the underlying runtime, then gpu-operator will use that as all GPU containers will be launched using docker. containerd will not be used in that case.
I have a simple k8s cluster using v1.23 and I attempt to install the gpu operator on it; specifying to use the containerd args:
The install is successful but inspecting the toolkit's daemonset shows that it's setting the
RUNTIME
environment variable to docker, instead of containerd.Is there a way I can I install the toolkit to use the containerd runtime instead?