NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.81k stars 295 forks source link

550.90.07-5.15.0-1061-gke-ubuntu22.04 image tag not found when installing with `driver.usePrecompiled` on GKE #933

Open chipzoller opened 2 months ago

chipzoller commented 2 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

When the operator is installed with driver.usePrecompiled: true on GKE, the nvidia-driver-daemonset-5.15.0-1061-gke-ubuntu22.04 DaemonSet fails to start because the image tag 550.90.07-5.15.0-1061-gke-ubuntu22.04 cannot be found in the nvcr.io/nvidia/driver repository.

3. Steps to reproduce the issue

  1. Install the v24.6.1 operator on GKE with the following in your values file.
    driver:
    enabled: true
    usePrecompiled: true
  2. See the operator spawns a DaemonSet called nvidia-driver-daemonset-5.15.0-1061-gke-ubuntu22.04
  3. See the image defined in the DaemonSet cannot be pulled.
chipzoller commented 2 months ago

Yes, I understand per the GKE docs here that drivers must be separately installed and from here that driver installation must be disabled via the operator, but it seems like there should be a validation check in the operator preventing installation if an image tag doesn't exist.

cdesiniotis commented 2 months ago

@chipzoller there are no precompiled driver packages for the gke kernels which is why we do not have any precompiled container images for Ubuntu 22.04 + gke kernel variant. If you want the GPU Operator to deploy and manage the lifecycle of the driver, you will need to use the non-precompiled images.

chipzoller commented 2 months ago

Hi @cdesiniotis, yes I get that, just stating with this issue that there isn't any mechanism to prevent users from hitting this situation. My recommendation is some template logic which blocks this condition so the chart fails to deploy rather than happily being deployed only for some component to fail to come up due to an unavailable tag.

cdesiniotis commented 2 months ago

@chipzoller the kernel version, and thus the precompiled driver image tag, is not known until runtime. The gpu-operator constructs the image tag from the OS name + kernel version running on the GPU node -- it gets the needed information from node labels added by Node Feature Discovery. I don't believe this is something we can easily validate at the point in time when the chart is installed.

chipzoller commented 2 months ago

You should be able to use the Helm lookup() function to retrieve a node's labels and then fail conditionally. This would have some potential negative implications, however, as some tools don't support this including some cloud vendor marketplace catalogs if I recall. An alternative could be to fail in the operator container and print the relevant message rather than template a resource with an invalid image tag. If none of those seem like viable options, feel free to close this as not planned. Just throwing some ideas out there that may help others.

cdesiniotis commented 2 months ago

An alternative could be to fail in the operator container and print the relevant message rather than template a resource with an invalid image tag.

This seems like the most reasonable option if we wanted to fail earlier.