NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 303 forks source link

container-toolkit fails to start after upgrading to v24.9.0 on k3s cluster #1109

Open logan2211 opened 2 weeks ago

logan2211 commented 2 weeks ago

This is effectively a continuation of #1099, but I cannot re-open that issue, so opening a new one.

I am experiencing the same problem while attempting to to upgrade from v24.6.0 to v24.9.0 on a k3s cluster. Perhaps a bad interaction related to this recent commit and the non-standard CONTAINERD paths required for gpu-operator+k3s, specified in my cluster's values as:

    toolkit:
      env:
      - name: CONTAINERD_CONFIG
        value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
      - name: CONTAINERD_SOCKET
        value: /run/k3s/containerd/containerd.sock
      - name: CONTAINERD_SET_AS_DEFAULT
        value: "false"

The pod log:

nvidia-container-toolkit-ctr IS_HOST_DRIVER=false
nvidia-container-toolkit-ctr NVIDIA_DRIVER_ROOT=/run/nvidia/driver
nvidia-container-toolkit-ctr DRIVER_ROOT_CTR_PATH=/driver-root
nvidia-container-toolkit-ctr NVIDIA_DEV_ROOT=/run/nvidia/driver
nvidia-container-toolkit-ctr DEV_ROOT_CTR_PATH=/driver-root
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Parsing arguments"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Starting nvidia-toolkit"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="disabling device node creation since --cdi-enabled=false"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Verifying Flags"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg=Initializing
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Shutting Down"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=error msg="error running nvidia-toolkit: unable to determine runtime options: unable to load containerd config: failed to load config: failed to run command chroot [/host containerd config dump]: exit status 127"

I confirmed that gpu-operator is setting the correct CONTAINERD_* paths according to my values:

  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rancher/k3s/agent/etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/k3s/containerd
    HostPathType:
logan2211 commented 2 weeks ago

It makes sense that the command it is trying to run fails, there is no containerd binary on the host, so chroot /host containerd config dump is expected to fail on a k3s cluster.

logan2211 commented 2 weeks ago

Looks like the fix is in progress here https://github.com/NVIDIA/nvidia-container-toolkit/pull/777

cdesiniotis commented 2 weeks ago

Hi @logan2211 thanks for reporting this issue. It is on our radar and we are working on getting a fix out for this. We recently switched to fetching the currently applied container runtime configuration via CLI (e.g. containerd config dump) rather than from a file (see https://github.com/NVIDIA/nvidia-container-toolkit/commit/f477dc0df1007c07f997cd575b0c690897458ac1) . This appears to have broke systems where the CLI binary is not in the PATH, like k3s. We are working on using the TOML file as a fallback option in case the CLI binary cannot be found: https://github.com/NVIDIA/nvidia-container-toolkit/pull/777

logan2211 commented 2 weeks ago

Thanks, we are trying to upgrade the cluster urgently due to the CVEs. I suppose one possible workaround may be to downgrade gpu-operator to v24.6.2 and override the driver version to 550.127.05?

edit: after testing, proposed workaround downgrading gpu-operator and pinning the driver in values seems to work fine, just noting for anyone else experiencing this issue.

cdesiniotis commented 2 weeks ago

downgrade gpu-operator to v24.6.2 and override the driver version to 550.127.05?

This should work. Or alternatively you can stick to GPU Operator v24.9.0 and downgrade the NVIDIA Container Toolkit to 1.16.2, which does not contain this change.

cdesiniotis commented 1 week ago

NVIDIA Container Toolkit 1.17.1 is now available and contains a fix for this issue: https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.17.1

I would recommend overriding the NVIDIA Container Toolkit version to 1.17.1 by configuring toolkit.version in Clusterpolicy.

elpsyr commented 1 day ago

This approach can solve this problem :)

vi /usr/local/bin/containerd

 #!/bin/bash
 /var/lib/rancher/rke2/bin/containerd --config /var/lib/rancher/rke2/agent/etc/containerd/config.toml "$@"

sudo chmod +x /usr/local/bin/containerd