GoogleCloudPlatform / container-engine-accelerators

Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Apache License 2.0
211 stars 150 forks source link

Node Auto-Provisioning failing for certain GPU nodes (T4) #402

Open agam opened 3 weeks ago

agam commented 3 weeks ago

How to re-create

A job that is marked as requiring nvidia.com/gpu, if results in a new node being spun up in GKE, will fail to be scheduled on that node.

Why is this bad

Details on error

The provisioned node has a nvidia-device-plugin pod This pod has a nvidia-driver-installer container which is an init container This container is stuck on startup

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 100   720  100   720    0     0   113k      0 --:--:-- --:--:-- --:--:--  117k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.

As a result, the kubelet never registers the nvidia.com/gpu resource, which means that the job (which triggered the node in the first place!) can't get its pods scheduled on it.

Prior context:

This is based off the following issue, which is no longer fixed (but which I cannot reopen)

https://github.com/GoogleCloudPlatform/container-engine-accelerators/issues/356

agam commented 2 weeks ago

fwiw I'm seeing this on A100 nodes too now