Open chipzoller opened 2 months ago
@chipzoller there are no precompiled driver packages for the gke
kernels which is why we do not have any precompiled container images for Ubuntu 22.04 + gke
kernel variant. If you want the GPU Operator to deploy and manage the lifecycle of the driver, you will need to use the non-precompiled images.
Hi @cdesiniotis, yes I get that, just stating with this issue that there isn't any mechanism to prevent users from hitting this situation. My recommendation is some template logic which blocks this condition so the chart fails to deploy rather than happily being deployed only for some component to fail to come up due to an unavailable tag.
@chipzoller the kernel version, and thus the precompiled driver image tag, is not known until runtime. The gpu-operator constructs the image tag from the OS name + kernel version running on the GPU node -- it gets the needed information from node labels added by Node Feature Discovery. I don't believe this is something we can easily validate at the point in time when the chart is installed.
You should be able to use the Helm lookup()
function to retrieve a node's labels and then fail conditionally. This would have some potential negative implications, however, as some tools don't support this including some cloud vendor marketplace catalogs if I recall. An alternative could be to fail in the operator container and print the relevant message rather than template a resource with an invalid image tag. If none of those seem like viable options, feel free to close this as not planned. Just throwing some ideas out there that may help others.
An alternative could be to fail in the operator container and print the relevant message rather than template a resource with an invalid image tag.
This seems like the most reasonable option if we wanted to fail earlier.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
When the operator is installed with
driver.usePrecompiled: true
on GKE, thenvidia-driver-daemonset-5.15.0-1061-gke-ubuntu22.04
DaemonSet fails to start because the image tag550.90.07-5.15.0-1061-gke-ubuntu22.04
cannot be found in thenvcr.io/nvidia/driver
repository.3. Steps to reproduce the issue
nvidia-driver-daemonset-5.15.0-1061-gke-ubuntu22.04