A job that is marked as requiring nvidia.com/gpu, if results in a new node being spun up in GKE, will fail to be scheduled on that node.
Why is this bad
Using GPU nodes with Node-Auto-Provisioning in GKE is broken (at least for T4s, not sure which other GPU types are affected)
It feels strange that such a core "elasticity behavior" is unacknowledged -- hoping this issue gets attention and results in at least an ETA for the fix
Details on error
The provisioned node has a nvidia-device-plugin pod
This pod has a nvidia-driver-installer container which is an init container
This container is stuck on startup
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 720 100 720 0 0 113k 0 --:--:-- --:--:-- --:--:-- 117k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.
As a result, the kubelet never registers the nvidia.com/gpu resource, which means that the job (which triggered the node in the first place!) can't get its pods scheduled on it.
Prior context:
This is based off the following issue, which is no longer fixed (but which I cannot reopen)
How to re-create
A job that is marked as requiring
nvidia.com/gpu
, if results in a new node being spun up in GKE, will fail to be scheduled on that node.Why is this bad
Details on error
The provisioned node has a
nvidia-device-plugin
pod This pod has anvidia-driver-installer
container which is aninit
container This container is stuck on startupAs a result, the kubelet never registers the
nvidia.com/gpu
resource, which means that the job (which triggered the node in the first place!) can't get its pods scheduled on it.Prior context:
This is based off the following issue, which is no longer fixed (but which I cannot reopen)
https://github.com/GoogleCloudPlatform/container-engine-accelerators/issues/356