Closed bigbitbus closed 6 months ago
I'm also seeing this, all new COS GPUs are broken. Ubuntu ones are working.
Also seeing this as well. Would love an update
Looks like Google has changed permission to the bucket where the nvidia files originate. Even their grid driver example links are now returning 403 https://cloud.google.com/compute/docs/gpus/install-grid-drivers.
@orktes yes their auto install also doesn't work because its looking in the same bucket
Are you all trying TESLA T4's?
This issue is manifesting for T4s and L4s at least. Moreover, Kubernetes keeps trying to scale up the node pools as it doesnt find usable GPU machines, - watch your costs $$$$$ balloon.
We got an update from GCP support. After investigation they acknowledged it as a bug and are working to fix it. We didn't receive any ETA yet.
Also seems to affect both US and EU buckets at least.
i.e https://storage.googleapis.com/nvidia-drivers-eu-public/tesla/525.125.06/NVIDIA-Linux-x86_64-525.125.06.run etc are also returning 403.
We noticed that some older versions (470.223.02 at least) of the drivers are working. Sadly was not feasible workaround for us due to the CUDA version we are using but might be worth trying for others.
@orktes we're currently using cos-nvidia-installer:fixed
image as the installer. Is there a specific image tag to get the installer for the old drivers?
@orktes we're currently using
cos-nvidia-installer:fixed
image as the installer. Is there a specific image tag to get the installer for the old drivers?
/cos-gpu-installer list
/cos-gpu-installer install -version VERSION_HERE
So for example here its just taking the latest https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/91075bc917df4dafb47ba7d903d57f48da5c932c/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml#L120 but you can also provide a static version.
@dbason you might want to list all the version available for your cluster with /cos-gpu-installer list
first. It seems that the older driver versions (470) are hosted in a different GCS bucket so they work.
This driver installation works (but 470 driver, so no L4)
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
Looks like P4, P100, V100, K80
drivers are unaffected.
We tried rolling back to version 535.104.12
which was working earlier today but that is no longer working either. Based on the previous replies it looks like all the 535 drivers are in the same bucket which has had its permissions changed.
This is now a known outage with GCP
Trying some workaround with to at least unblock the installer piece but no dice on COS
cos-extensions install gpu -- -allow-unsigned-driver -nvidia-installer-url https://us.download.nvidia.com/tesla/525.125.06/NVIDIA-Linux-x86_64-525.125.06.run
I0312 20:38:07.568070 2152802 install.go:248] Installing GPU driver from "https://us.download.nvidia.com/tesla/525.125.06/NVIDIA-Linux-x86_64-525.125.06.run"
I0312 20:38:07.568512 2152802 cos.go:31] Checking kernel module signing.
I0312 20:38:07.568548 2152802 installer.go:128] Configuring driver installation directories
I0312 20:38:07.597459 2152802 utils.go:88] Downloading toolchain_env from https://storage.googleapis.com/cos-tools-asia/17800.66.78/lakitu/toolchain_env
I0312 20:38:07.694461 2152802 cos.go:73] Installing the toolchain
I0312 20:38:07.694514 2152802 utils.go:88] Downloading toolchain.tar.xz from https://storage.googleapis.com/cos-tools-asia/17800.66.78/lakitu/toolchain.tar.xz
I0312 20:38:12.138665 2152802 cos.go:94] Unpacking toolchain...
I0312 20:40:02.064679 2152802 cos.go:99] Done unpacking toolchain
I0312 20:40:02.064919 2152802 utils.go:88] Downloading kernel-headers.tgz from https://storage.googleapis.com/cos-tools-asia/17800.66.78/lakitu/kernel-headers.tgz
I0312 20:40:02.240067 2152802 cos.go:109] Unpacking kernel headers...
I0312 20:40:03.226125 2152802 cos.go:113] Done unpacking kernel headers
I0312 20:40:03.226263 2152802 utils.go:88] Downloading Unofficial GPU driver installer from https://us.download.nvidia.com/tesla/525.125.06/NVIDIA-Linux-x86_64-525.125.06.run
I0312 20:40:05.315577 2152802 installer.go:325] Running GPU driver installer
I0312 20:40:35.940491 2152802 installer.go:165] Extracting precompiled artifacts...
E0312 20:40:35.940648 2152802 install.go:457] failed to run GPU driver installer: failed to extract precompiled artifacts: failed to read "/tmp/extract/kernel/precompiled": open /tmp/extract/kernel/precompiled: no such file or directory
Latest from GCP support (see section on partial mitigation)
Seeing this issue as well. Google support informed me that this is a p0 issue, so hopefully it is resolved soon
the workaround is to simply migrate node pool to Ubuntu (UBUNTU_CONTAINERD) for L4's you'll have to disable automatic driver install and install 525 manually: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu
the workaround is to simply migrate node pool to Ubuntu (UBUNTU_CONTAINERD) for L4's you'll have to disable automatic driver install and install 525 manually: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu
Didn't work for me, the available driver is a lower 525, not sure why if thats why my workload didn't work.
the workaround is to simply migrate node pool to Ubuntu (UBUNTU_CONTAINERD) for L4's you'll have to disable automatic driver install and install 525 manually: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu
also note that you are stuck with COS if you're using GKE Autopilot.
You can upload your own file(.run) to your GCP project's public bucket with structure like in code. https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/internal/installer/installer.go#41
Then modify the code, build it with go build -o cos-gpu-installer src/cmd/cos_gpu_installer/main.go
Then build an image like:
$ cat Dockerfile
FROM gcr.io/cos-cloud/cos-gpu-installer@sha256:8b8247d9d16ee92e13a33c6206c1727ff97e7dfa52d5732299493e3198bd4719
COPY cos-gpu-installer /cos-gpu-installer
, push it and use it in the daemonset EDIT: Works with L4, not sure about other.
You can upload your own file(.run) to your GCP project's public bucket with structure like in code. https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/internal/installer/installer.go#41
Then modify the code, build it with
go build -o cos-gpu-installer src/cmd/cos_gpu_installer/main.go
Then build an image like:$ cat Dockerfile FROM gcr.io/cos-cloud/cos-gpu-installer@sha256:8b8247d9d16ee92e13a33c6206c1727ff97e7dfa52d5732299493e3198bd4719 COPY cos-gpu-installer /cos-gpu-installer
, push it and use it in the daemonset EDIT: Works with L4, not sure about other.
Could do this, but what feels a bit better would be to
/cos-gpu-installer -allow-unsigned-driver -nvidia-installer-url=<FILE_DOMAIN>/NVIDIA-Linux-x86_64-525.125.06.run
Offhand, don't have a proper .run file (the nvidia hosted-files don't seem to extract right); if someone has a 525 .run file hosted, that'd be helpful.
It's back and working; this loads https://storage.googleapis.com/nvidia-drivers-us-public/tesla/535.129.03/NVIDIA-Linux-x86_64-535.129.03.run
This is the resolution announcement email from GCP; they will publish an analysis of the outage later.
Thanks y'all for the productive ideas and workarounds, and moral support!
Closing the issue now that its fixed.
Having this issue with both latest and defult on l4 in us-central1-c
NAME↑ PF IMAGE READY STATE INIT │
│ check-socket ● gke.gcr.io/gke-distroless/bash@sha256:b7ef809eea4cc43476ad4c2628e3eb54faf2aecd08ee41cb4a432f1a383c7d92 false PodInitializing true │
│ nvidia-driver-installer ● cos-nvidia-installer:fixed false Running true │
│ nvidia-gpu-device-plugin ● gke.gcr.io/nvidia-gpu-device-plugin@sha256:13574d1d59e5063705064a30f22d93eb3994e97fbd850a5a19ffc33b48eb25cc false PodInitializing false │
│ partition-gpus ● gke.gcr.io/nvidia-partition-gpu@sha256:6b63ed9aae6061b9062f3898638c8018816d9d5916276a6c6abdb93115bfe3e3 false PodInitializing true
Here are the logs for the driver installer
┌─────────────────────────────────────── Logs(kube-system/nvidia-gpu-device-plugin-small-cos-wt9cb:nvidia-driver-installer)[tail] ────────────────────────────────────────┐
│ Autoscroll:On FullScreen:Off Timestamps:Off Wrap:Off │
│ % Total % Received % Xferd Average Speed Time Time Time Current │
│ Dload Upload Total Spent Left Speed │
│ 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 645 100 645 0 0 27519 0 --:--:-- --:--:-- --:--:-- 28043 │
│ GPU driver auto installation is disabled. │
│ Waiting for GPU driver libraries to be available.
Seeing the same as above for a T4:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 720 100 720 0 0 113k 0 --:--:-- --:--:-- --:--:-- 117k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.
This is blocking the usage of T4 nodes with Node-Auto-Provisioning, because the nvidia.com/gpu
resource is not registered, and pods cannot be scheduled on them.
The storage location
https://storage.googleapis.com/nvidia-drivers-us-public/tesla/535.129.03/NVIDIA-Linux-x86_64-535.129.03.run
is returning 403/Forbidden.
So all driver installation is failing via the Daemonsets.
Does anyone know who owns the permissions for the bucket "nvidia-drivers-us-public" and if/why they were changed earlier today.