Open logan2211 opened 2 weeks ago
It makes sense that the command it is trying to run fails, there is no containerd
binary on the host, so chroot /host containerd config dump
is expected to fail on a k3s cluster.
Looks like the fix is in progress here https://github.com/NVIDIA/nvidia-container-toolkit/pull/777
Hi @logan2211 thanks for reporting this issue. It is on our radar and we are working on getting a fix out for this. We recently switched to fetching the currently applied container runtime configuration via CLI (e.g. containerd config dump
) rather than from a file (see https://github.com/NVIDIA/nvidia-container-toolkit/commit/f477dc0df1007c07f997cd575b0c690897458ac1) . This appears to have broke systems where the CLI binary is not in the PATH, like k3s. We are working on using the TOML file as a fallback option in case the CLI binary cannot be found: https://github.com/NVIDIA/nvidia-container-toolkit/pull/777
Thanks, we are trying to upgrade the cluster urgently due to the CVEs. I suppose one possible workaround may be to downgrade gpu-operator to v24.6.2 and override the driver version to 550.127.05?
edit: after testing, proposed workaround downgrading gpu-operator and pinning the driver in values seems to work fine, just noting for anyone else experiencing this issue.
downgrade gpu-operator to v24.6.2 and override the driver version to 550.127.05?
This should work. Or alternatively you can stick to GPU Operator v24.9.0 and downgrade the NVIDIA Container Toolkit to 1.16.2, which does not contain this change.
NVIDIA Container Toolkit 1.17.1 is now available and contains a fix for this issue: https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.17.1
I would recommend overriding the NVIDIA Container Toolkit version to 1.17.1 by configuring toolkit.version
in Clusterpolicy.
This approach can solve this problem :)
vi /usr/local/bin/containerd
#!/bin/bash
/var/lib/rancher/rke2/bin/containerd --config /var/lib/rancher/rke2/agent/etc/containerd/config.toml "$@"
sudo chmod +x /usr/local/bin/containerd
This is effectively a continuation of #1099, but I cannot re-open that issue, so opening a new one.
I am experiencing the same problem while attempting to to upgrade from v24.6.0 to v24.9.0 on a k3s cluster. Perhaps a bad interaction related to this recent commit and the non-standard CONTAINERD paths required for gpu-operator+k3s, specified in my cluster's values as:
The pod log:
I confirmed that gpu-operator is setting the correct CONTAINERD_* paths according to my values: