Following gpu-operator documentation will break RKE2 cluster after reboot

aiicore commented 2 months ago

RKE2 docs says only about passing the config for RKE2's internal CONTAINERD_SOCKET: https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator

Nvidia's also about CONTAINERD_CONFIG: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2

Following gpu-operator documentation, those things will happen:

gpu-operator will write containerd config into /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
rke2 will pick it up as a template and make dedicated contained config: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
cluster will not get up after reboot, since the config provided by gpu-operator does not work with rke2

The most significant errors in the logs would be:

Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Pod for etcd not synced (pod sandbox has changed), retrying"
Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Waiting for API server to become available"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

Following RKE2 docs about passing only CONTAINERD_SOCKET works, since gpu-operator will write it's (not working with rke2 config) into /etc/containerd/config.toml, even though containerd is not installed at the OS level.

root@rke2:~# apt list --installed | grep containerd

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

root@rke2:~#

Looks like the containerd config, provided by gpu-operator with RKE2, doesn't matter since RKE2 is able to detect nvidia-container-runtime and configure it's own containerd conifg with nvidia runtime class:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
  SystemdCgroup = true

Steps to reproduce on Ubuntu 22.04:

Following Nvidia's docs breaks RKE2 cluster after reboot:

helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true

Following RKE2's docs works fine:

helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set toolkit.env[0].name=CONTAINERD_SOCKET \
    --set toolkit.env[0].value=/run/k3s/containerd/containerd.sock

Could someone verify the docs?

mikemckiernan commented 2 months ago

Anyone on the NVIDIA team object to replacing our sample command with a reference to the RKE2 docs? That's my preference.

https://docs.rke2.io/advanced#deploy-nvidia-operator

DevFontes commented 2 months ago

I'm using Ubuntu 22.04 with an NVIDIA RTX A2000 12GB and K8s 1.27.11+RKE2r1.

Is there any problem using the driver in version 560 and not 535 as indicated in the RKE Doc?

mikemckiernan commented 2 months ago

I'm fairly confident that using the 560 driver, or any driver covered in the product docs, is OK.

However, I'd like SME input from my teammates. When I followed the RKE doc, I've found that I need to specify runtimeClassName--like the sample nbody workload. I can't choose what other people prefer or dislike, but I happen to dislike that approach.

aiicore commented 2 months ago

@mikemckiernan I think it's due gpu-operator setting nvidia runtime class as the default in containerd. RKE2 just adds another runtime, which in my opinion is more clear approach. I don't know why gpu-operator have this option, maybe it's due to be consistent with docker? I remember that long time ago I needed to install nvidia runtime for docker and change default docker runtime for nvidia to make it work.

If the gpu-operator would work normally with RKE2, so creating valid config.toml.tmpl, nvidia runtime class would be the default, when CONTAINERD_SET_AS_DEFAULT=true.

justinthelaw commented 5 days ago

I am not sure what version of the GPU operator you are using, but would the following values file work for you, @aiicore?

https://github.com/defenseunicorns/uds-rke2/blob/main/packages/nvidia-gpu-operator/values/nvidia-gpu-operator-values.yaml

NVIDIA / gpu-operator

Following gpu-operator documentation will break RKE2 cluster after reboot #992