GoogleCloudPlatform / container-engine-accelerators

Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Apache License 2.0
214 stars 152 forks source link

nvidia-driver-installer crash loop during GKE scale ups #132

Open brannondorsey opened 4 years ago

brannondorsey commented 4 years ago

We've been using the nvidia-driver-installer on Ubuntu node groups via GKE v1.15 per the official How-to GPU instructions specified here.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

The daemonset deployed via daemonset-preloaded.yaml appeared to work correctly for some time, however we started noticing issues last Friday when new nodes were added to the node group via cluster autoscaling. The nvidia-driver-installer daemonset pods that were scheduled to these new nodes began to crash loop, as their initContainers were exiting with non-zero exit codes.

Upon examining pod logs, it appears that the failed pods contain the following lines as their last output before exiting.

Verifying Nvidia installation... DONE. 
ln: /root/home/kubernetes/bin/nvidia: cannot overwrite directory

See here for full log output from one of the failed pods.

I've logged into one of the nodes and manually removed the /root/home/kubernetes/bin/nvidia folder (which is presumably created by the very first instance of the nvidia-driver-installer pod scheduled to a node when it comes up) but the folder re-appears and the daemonset pods continue to crash in loop. Nodes that have daemonset pods in this state don't have the drivers correctly installed, and jobs that require them fail to import CUDA due to driver issues.

We've been experiencing this issue for 4 days now with nodes that receive live production traffic. Not every node that scales up experiences this problem, but most do. If a node comes up and its nvidia-driver-installer pod begins to crash, we've had no luck bringing it out of that state. Instead we've manually marked the node as unschedulable and brought it down, hoping the next to come up won't experience the same problem.

From our perspective, nothing has changed with our cluster configuration, node group configuration, or K8s manifests that would cause this issue to start occurring. We did experience something similar for a few hours in mid December, but the issue resolved itself within a few hours and we didn't think much of it. I'm happy to provide more logs or detailed information about the errors upon request!

Any thoughts about what could be causing this?

karan commented 4 years ago

I can't repro this with 1.15 on GKE.

gcloud container clusters create gpu-test --accelerator type=nvidia-tesla-k80,count=1 --zone=us-central1-c --num-nodes=1 --cluster-version=1.15

Then:

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
daemonset.apps/nvidia-driver-installer created

Then I scaled instances from 1 to 2, and saw that the driver is running fine.

$ ka get po | grep nvidia
kube-system   nvidia-driver-installer-c8jcm                               1/1     Running   0          3m34s
kube-system   nvidia-driver-installer-hrghh                               1/1     Running   0          12m
kube-system   nvidia-gpu-device-plugin-6jql6                              1/1     Running   0          3m34s
kube-system   nvidia-gpu-device-plugin-fdzf7                              1/1     Running   0          13m

Are you still seeing this? If so, can you please provide repro steps?

tangenti commented 4 years ago

We are unable to locate the exact root cause since the repro steps are missing here. A potential fix on gRPC is submitted in https://github.com/GoogleCloudPlatform/container-engine-accelerators/pull/135.

adityapatadia commented 3 years ago

This can be reproduced under Ubuntu but not in COS. Just rolled new cluster with Ubuntu and got this error.

ruiwen-zhao commented 3 years ago

This can be reproduced under Ubuntu but not in COS. Just rolled new cluster with Ubuntu and got this error.

Can you provide the GKE version and the OS where you reproduced this error?

adityapatadia commented 3 years ago

I used rapid channel Kubernetes 1.19. Ubuntu version I am not sure but I just selected it from drop down in create cluster wizard. GPU I used was Nvidia T4

On 25-Mar-2021, at 2:25 AM, Ruiwen Zhao @.***> wrote:

 This can be reproduced under Ubuntu but not in COS. Just rolled new cluster with Ubuntu and got this error.

Can you provide the GKE version and the OS where you reproduced this error?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

ClementGautier commented 3 years ago

I encountered the same issue on:

What's weird is that /root/home doesn't even exists on the node so I've no idea why the link is failing at being created by the pod. I tried to update to the latest version of the daemonset and it didn't help.

rarestg commented 2 years ago

Getting the same issue here on:

I can't get any logs out of the pod but describing it shows the error:

Controlled By:  DaemonSet/nvidia-driver-installer-ubuntu
Init Containers:
  nvidia-driver-installer:
    Container ID:   docker://3f92ca08c6a68900de40a0fc98b236240722191b06b1eaf00fa8ba67be04ffbe
    Image:          gke-nvidia-installer:fixed
    Image ID:       docker://sha256:50944645cd9975d5b2c904353e1ab5b2cdd41f4e959aefbe7b2624d0b8c43652
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 15 Dec 2021 15:48:43 -0500
      Finished:     Wed, 15 Dec 2021 15:48:55 -0500
    Ready:          False
    Restart Count:  1950
    Requests:
      cpu:        150m
    Environment:  <none>
    Mounts:
      /boot from boot (rw)
      /dev from dev (rw)
      /root from root-mount (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m9gck (ro)
omer-dayan commented 2 years ago

Just for you information. If you want to fix it there is a workaround: SSH into the node and run:

image

After that restart the pod of the installer and it will run successfully.