Open brannondorsey opened 4 years ago
I can't repro this with 1.15 on GKE.
gcloud container clusters create gpu-test --accelerator type=nvidia-tesla-k80,count=1 --zone=us-central1-c --num-nodes=1 --cluster-version=1.15
Then:
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
daemonset.apps/nvidia-driver-installer created
Then I scaled instances from 1 to 2, and saw that the driver is running fine.
$ ka get po | grep nvidia
kube-system nvidia-driver-installer-c8jcm 1/1 Running 0 3m34s
kube-system nvidia-driver-installer-hrghh 1/1 Running 0 12m
kube-system nvidia-gpu-device-plugin-6jql6 1/1 Running 0 3m34s
kube-system nvidia-gpu-device-plugin-fdzf7 1/1 Running 0 13m
Are you still seeing this? If so, can you please provide repro steps?
We are unable to locate the exact root cause since the repro steps are missing here. A potential fix on gRPC is submitted in https://github.com/GoogleCloudPlatform/container-engine-accelerators/pull/135.
This can be reproduced under Ubuntu but not in COS. Just rolled new cluster with Ubuntu and got this error.
This can be reproduced under Ubuntu but not in COS. Just rolled new cluster with Ubuntu and got this error.
Can you provide the GKE version and the OS where you reproduced this error?
I used rapid channel Kubernetes 1.19. Ubuntu version I am not sure but I just selected it from drop down in create cluster wizard. GPU I used was Nvidia T4
On 25-Mar-2021, at 2:25 AM, Ruiwen Zhao @.***> wrote:
This can be reproduced under Ubuntu but not in COS. Just rolled new cluster with Ubuntu and got this error.
Can you provide the GKE version and the OS where you reproduced this error?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
I encountered the same issue on:
What's weird is that /root/home doesn't even exists on the node so I've no idea why the link is failing at being created by the pod. I tried to update to the latest version of the daemonset and it didn't help.
Getting the same issue here on:
I can't get any logs out of the pod but describing it shows the error:
Controlled By: DaemonSet/nvidia-driver-installer-ubuntu
Init Containers:
nvidia-driver-installer:
Container ID: docker://3f92ca08c6a68900de40a0fc98b236240722191b06b1eaf00fa8ba67be04ffbe
Image: gke-nvidia-installer:fixed
Image ID: docker://sha256:50944645cd9975d5b2c904353e1ab5b2cdd41f4e959aefbe7b2624d0b8c43652
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 15 Dec 2021 15:48:43 -0500
Finished: Wed, 15 Dec 2021 15:48:55 -0500
Ready: False
Restart Count: 1950
Requests:
cpu: 150m
Environment: <none>
Mounts:
/boot from boot (rw)
/dev from dev (rw)
/root from root-mount (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-m9gck (ro)
Just for you information. If you want to fix it there is a workaround: SSH into the node and run:
After that restart the pod of the installer and it will run successfully.
We've been using the nvidia-driver-installer on Ubuntu node groups via GKE v1.15 per the official How-to GPU instructions specified here.
The daemonset deployed via
daemonset-preloaded.yaml
appeared to work correctly for some time, however we started noticing issues last Friday when new nodes were added to the node group via cluster autoscaling. The nvidia-driver-installer daemonset pods that were scheduled to these new nodes began to crash loop, as their initContainers were exiting with non-zero exit codes.Upon examining pod logs, it appears that the failed pods contain the following lines as their last output before exiting.
See here for full log output from one of the failed pods.
I've logged into one of the nodes and manually removed the
/root/home/kubernetes/bin/nvidia
folder (which is presumably created by the very first instance of the nvidia-driver-installer pod scheduled to a node when it comes up) but the folder re-appears and the daemonset pods continue to crash in loop. Nodes that have daemonset pods in this state don't have the drivers correctly installed, and jobs that require them fail to import CUDA due to driver issues.We've been experiencing this issue for 4 days now with nodes that receive live production traffic. Not every node that scales up experiences this problem, but most do. If a node comes up and its nvidia-driver-installer pod begins to crash, we've had no luck bringing it out of that state. Instead we've manually marked the node as unschedulable and brought it down, hoping the next to come up won't experience the same problem.
From our perspective, nothing has changed with our cluster configuration, node group configuration, or K8s manifests that would cause this issue to start occurring. We did experience something similar for a few hours in mid December, but the issue resolved itself within a few hours and we didn't think much of it. I'm happy to provide more logs or detailed information about the errors upon request!
Any thoughts about what could be causing this?