Open christidis opened 1 year ago
+1
worked for me after adding this part to the part to the node pool terraform code
guest_accelerator {
type = "nvidia-l4"
count = 1
}
and installed the new version of the GPU driver
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
+1
Description
I am trying to use G2 with L4 GPU in GKE
GKE Control Plane version
v1.24.12-gke.500
Nodepool version versionv1.24.9-gke.3200
Based on the documentation, these are the requirements of installing L4 GPU in Kubernetes.
Requirements
L4 GPUs:
Daemonset drivers
For this I have configured a COS based g2-standard-12 nodepool which includes an L4 GPU by default and deployed it in my cluster.
I have ensured that I install the drivers mentioned in the documentation
Then i noticed the pods are in a
CrashLoopBackOff
stateLogs
Latest Daemonset drivers
I then installed the latest daemonset
just to see more or less the same errors in the driver installer
(the logs are duplicate because the nodes are 2 and the
nvidia-driver-installer
pods are also 2).Daemonset 525 drivers (installed manually)
I have also tried fetching the latest daemonset locally and edit it in order to install a specific version of the 525 driver (tried all of them, they all failed with the same error)
and the driver installation failed again
Conclusion
I didn't have such issues with other GPU types in the past. I have now switched to P100 on ubuntu but I am really interested in using G2 with L4 as it is a better fit for our use case.
Is there any way to have G2 with L4 GPU with a working driver in GKE with either Ubuntu or a COS image type?