Node Auto-Provisioning failing for certain GPU nodes (T4)

How to re-create

A job that is marked as requiring nvidia.com/gpu, if results in a new node being spun up in GKE, will fail to be scheduled on that node.

Why is this bad

Using GPU nodes with Node-Auto-Provisioning in GKE is broken (at least for T4s, not sure which other GPU types are affected)
It feels strange that such a core "elasticity behavior" is unacknowledged -- hoping this issue gets attention and results in at least an ETA for the fix

Details on error

The provisioned node has a nvidia-device-plugin pod This pod has a nvidia-driver-installer container which is an init container This container is stuck on startup

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 100   720  100   720    0     0   113k      0 --:--:-- --:--:-- --:--:--  117k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.

As a result, the kubelet never registers the nvidia.com/gpu resource, which means that the job (which triggered the node in the first place!) can't get its pods scheduled on it.

Prior context:

This is based off the following issue, which is no longer fixed (but which I cannot reopen)

https://github.com/GoogleCloudPlatform/container-engine-accelerators/issues/356

GoogleCloudPlatform / container-engine-accelerators