NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.78k stars 285 forks source link

toolkit uses docker runtime, when containerd args are specified #683

Open nonoy-suguitan opened 6 months ago

nonoy-suguitan commented 6 months ago

I have a simple k8s cluster using v1.23 and I attempt to install the gpu operator on it; specifying to use the containerd args:

eksctl create cluster \
--name test-gpu-cluster-eksctl \
--version 1.23 \
--region us-east-1 \
--nodegroup-name gpu-nodes \
--node-type g4dn.xlarge \
--nodes 1

helm install --wait gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set 'toolkit.env[0].name=CONTAINERD_CONFIG' \
    --set 'toolkit.env[0].value=/etc/containerd/config.toml' \
    --set 'toolkit.env[1].name=CONTAINERD_SOCKET' \
    --set 'toolkit.env[1].value=/run/containerd/containerd.sock' \
    --set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS' \
    --set 'toolkit.env[2].value=nvidia' \
    --set 'toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT' \
    --set-string 'toolkit.env[3].value=true'

The install is successful but inspecting the toolkit's daemonset shows that it's setting the RUNTIME environment variable to docker, instead of containerd.

        env:
        - name: ROOT
          value: /usr/local/nvidia
        - name: RUNTIME_ARGS
        - name: NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND
          value: management.nvidia.com/gpu
        - name: NVIDIA_VISIBLE_DEVICES
          value: void
        - name: CONTAINERD_CONFIG
          value: /etc/containerd/config.toml
        - name: CONTAINERD_SOCKET
          value: /run/containerd/containerd.sock
        - name: CONTAINERD_RUNTIME_CLASS
          value: nvidia
        - name: CONTAINERD_SET_AS_DEFAULT
          value: "true"
        - name: RUNTIME
          value: docker
        - name: DOCKER_CONFIG
          value: /runtime/config-dir/daemon.json
        - name: DOCKER_SOCKET
          value: /runtime/sock-dir/docker.sock

Is there a way I can I install the toolkit to use the containerd runtime instead?

elezar commented 6 months ago

You should be able to force containerd by setting the RUNTIME envvar: https://github.com/NVIDIA/nvidia-container-toolkit/blob/1ddc859700c0d698f7f155fdbf7ae6f77ea0c1f5/tools/container/nvidia-toolkit/run.go#L78

I'm not sure why docker is being detected by the operator. Which version are you installing?

nonoy-suguitan commented 6 months ago

I set the RUNTIME envvar via helm:

helm install --wait gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set 'toolkit.env[0].name=RUNTIME' \
    --set 'toolkit.env[0].value=containerd' \
    --set 'toolkit.env[1].name=CONTAINERD_CONFIG' \
    --set 'toolkit.env[1].value=/etc/containerd/config.toml' \
    --set 'toolkit.env[2].name=CONTAINERD_SOCKET' \
    --set 'toolkit.env[2].value=/run/containerd/containerd.sock' \
    --set 'toolkit.env[3].name=CONTAINERD_RUNTIME_CLASS' \
    --set 'toolkit.env[3].value=nvidia' \
    --set 'toolkit.env[4].name=CONTAINERD_SET_AS_DEFAULT' \
    --set-string 'toolkit.env[4].value=true'

but the toolkit continues to use the docker RUNTIME nvidia-container-toolkit-ds-runtime-containerd-orig.yaml.txt

So I manually edited the daemonset to update the RUNTIME, along with the corresponding containerd configurations (volume, mounts, paths) nvidia-container-toolkit-ds-runtime-containerd-mod.yaml.txt

Which caused docker in the worker node to become unavailable (thus the worker node went into a NotReady state).

$ journalctl -u docker
Mar 22 16:27:38 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:38.711843457Z" level=info msg="Got signal to reload configuration, reloading from: /etc/docker/daemon.json"
Mar 22 16:27:38 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:38.711981033Z" level=error msg="unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: default-runtime: (from f
Mar 22 16:27:38 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:38.737324033Z" level=info msg="ignoring event" container=42c6c9c999618843c25e61a536df22755a0b8fe8dd6fccd4fc41fd9aa206c72f module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 22 16:27:38 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:38.904699844Z" level=info msg="ignoring event" container=387d2d2d3de14bdb44c076c6ec637fd017000da66128cec1fc51d7ed8937d81c module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 22 16:27:40 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:40.996234138Z" level=info msg="ignoring event" container=9603f5e4961c4ddb948b8e8078d925b45e7cce3528565f14a2678252955d304a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120253144Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=plugins.moby
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120315826Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=plugins.moby
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120337445Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial uni
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120364292Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial uni
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120258368Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=moby
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.120438486Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal systemd[1]: Stopping Docker Application Container Engine...
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.195548180Z" level=info msg="Processing signal 'terminated'"
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal dockerd[3246]: time="2024-03-22T16:27:47.199483868Z" level=info msg="Daemon shutdown complete"
Mar 22 16:27:47 ip-192-168-9-254.ec2.internal systemd[1]: Stopped Docker Application Container Engine.
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal systemd[1]: Starting Docker Application Container Engine...
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal dockerd[44320]: unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: default-runtime: (from flag: nvidia, from file: runc)
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal systemd[1]: Failed to start Docker Application Container Engine.
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal systemd[1]: Unit docker.service entered failed state.
Mar 22 16:29:53 ip-192-168-9-254.ec2.internal systemd[1]: docker.service failed.
Mar 22 16:29:55 ip-192-168-9-254.ec2.internal systemd[1]: docker.service holdoff time over, scheduling restart.
Mar 22 16:29:55 ip-192-168-9-254.ec2.internal systemd[1]: Stopped Docker Application Container Engine.
Mar 22 16:29:55 ip-192-168-9-254.ec2.internal systemd[1]: Starting Docker Application Container Engine...
Mar 22 16:29:55 ip-192-168-9-254.ec2.internal dockerd[44350]: unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: default-runtime: (from flag: nvidia, from file: runc)

I was able to verify that:

Docker version of worker node:

$ docker version
Client:
 Version:           20.10.25
 API version:       1.41
 Go version:        go1.20.12
 Git commit:        b82b9f3
 Built:             Fri Dec 29 20:37:18 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.25
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.20.12
  Git commit:       5df983c
  Built:            Fri Dec 29 20:38:05 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.11
  GitCommit:        64b8a811b07ba6288238eefc14d898ee0b5b99ba
 nvidia:
  Version:          1.1.11
  GitCommit:        4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

gpu-operator version

% helm list -A
NAME            NAMESPACE       REVISION    UPDATED                                 STATUS      CHART                   APP VERSION
gpu-operator    gpu-operator    1           2024-03-22 09:12:12.634727 -0700 PDT    deployed    gpu-operator-v23.9.2    v23.9.2
nonoy-suguitan commented 6 months ago

For what it's worth, I've got this working on a k8s v1.25 cluster (where containerd runtime is used and not docker).

I'm just wondering if there's a way to bypass the docker dependency (that is, directly using containerd), while k8s is using docker (in k8s versions such as v1.23).

shivamerla commented 6 months ago

@nonoy-suguitan that configuration doesn't make sense. If you have setup kubelet to use dockershim and docker is the underlying runtime, then gpu-operator will use that as all GPU containers will be launched using docker. containerd will not be used in that case.