GoogleCloudPlatform / container-engine-accelerators

Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Apache License 2.0
211 stars 150 forks source link

Nvidia Driver Public Bucket returning 403 - breaking ALL driver installation #356

Closed bigbitbus closed 6 months ago

bigbitbus commented 6 months ago

The storage location

https://storage.googleapis.com/nvidia-drivers-us-public/tesla/535.129.03/NVIDIA-Linux-x86_64-535.129.03.run

is returning 403/Forbidden.

So all driver installation is failing via the Daemonsets.

Does anyone know who owns the permissions for the bucket "nvidia-drivers-us-public" and if/why they were changed earlier today.

praveenperera commented 6 months ago

I'm also seeing this, all new COS GPUs are broken. Ubuntu ones are working.

abrowne2 commented 6 months ago

Also seeing this as well. Would love an update

orktes commented 6 months ago

Looks like Google has changed permission to the bucket where the nvidia files originate. Even their grid driver example links are now returning 403 https://cloud.google.com/compute/docs/gpus/install-grid-drivers.

praveenperera commented 6 months ago

@orktes yes their auto install also doesn't work because its looking in the same bucket

praveenperera commented 6 months ago

Are you all trying TESLA T4's?

bigbitbus commented 6 months ago

This issue is manifesting for T4s and L4s at least. Moreover, Kubernetes keeps trying to scale up the node pools as it doesnt find usable GPU machines, - watch your costs $$$$$ balloon.

orktes commented 6 months ago

We got an update from GCP support. After investigation they acknowledged it as a bug and are working to fix it. We didn't receive any ETA yet.

orktes commented 6 months ago

Also seems to affect both US and EU buckets at least.

i.e https://storage.googleapis.com/nvidia-drivers-eu-public/tesla/525.125.06/NVIDIA-Linux-x86_64-525.125.06.run etc are also returning 403.

orktes commented 6 months ago

We noticed that some older versions (470.223.02 at least) of the drivers are working. Sadly was not feasible workaround for us due to the CUDA version we are using but might be worth trying for others.

dbason commented 6 months ago

@orktes we're currently using cos-nvidia-installer:fixed image as the installer. Is there a specific image tag to get the installer for the old drivers?

orktes commented 6 months ago

@orktes we're currently using cos-nvidia-installer:fixed image as the installer. Is there a specific image tag to get the installer for the old drivers?

/cos-gpu-installer list
/cos-gpu-installer install -version VERSION_HERE

So for example here its just taking the latest https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/91075bc917df4dafb47ba7d903d57f48da5c932c/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml#L120 but you can also provide a static version.

Speissi commented 6 months ago

@dbason you might want to list all the version available for your cluster with /cos-gpu-installer list first. It seems that the older driver versions (470) are hosted in a different GCS bucket so they work.

bigbitbus commented 6 months ago

This driver installation works (but 470 driver, so no L4)

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

orktes commented 6 months ago

Looks like P4, P100, V100, K80 drivers are unaffected.

dbason commented 6 months ago

We tried rolling back to version 535.104.12 which was working earlier today but that is no longer working either. Based on the previous replies it looks like all the 535 drivers are in the same bucket which has had its permissions changed.

dbason commented 6 months ago

This is now a known outage with GCP

marwanad commented 6 months ago

Trying some workaround with to at least unblock the installer piece but no dice on COS

cos-extensions install gpu -- -allow-unsigned-driver -nvidia-installer-url https://us.download.nvidia.com/tesla/525.125.06/NVIDIA-Linux-x86_64-525.125.06.run
I0312 20:38:07.568070 2152802 install.go:248] Installing GPU driver from "https://us.download.nvidia.com/tesla/525.125.06/NVIDIA-Linux-x86_64-525.125.06.run"
I0312 20:38:07.568512 2152802 cos.go:31] Checking kernel module signing.
I0312 20:38:07.568548 2152802 installer.go:128] Configuring driver installation directories
I0312 20:38:07.597459 2152802 utils.go:88] Downloading toolchain_env from https://storage.googleapis.com/cos-tools-asia/17800.66.78/lakitu/toolchain_env
I0312 20:38:07.694461 2152802 cos.go:73] Installing the toolchain
I0312 20:38:07.694514 2152802 utils.go:88] Downloading toolchain.tar.xz from https://storage.googleapis.com/cos-tools-asia/17800.66.78/lakitu/toolchain.tar.xz
I0312 20:38:12.138665 2152802 cos.go:94] Unpacking toolchain...
I0312 20:40:02.064679 2152802 cos.go:99] Done unpacking toolchain
I0312 20:40:02.064919 2152802 utils.go:88] Downloading kernel-headers.tgz from https://storage.googleapis.com/cos-tools-asia/17800.66.78/lakitu/kernel-headers.tgz
I0312 20:40:02.240067 2152802 cos.go:109] Unpacking kernel headers...
I0312 20:40:03.226125 2152802 cos.go:113] Done unpacking kernel headers
I0312 20:40:03.226263 2152802 utils.go:88] Downloading Unofficial GPU driver installer from https://us.download.nvidia.com/tesla/525.125.06/NVIDIA-Linux-x86_64-525.125.06.run
I0312 20:40:05.315577 2152802 installer.go:325] Running GPU driver installer
I0312 20:40:35.940491 2152802 installer.go:165] Extracting precompiled artifacts...
E0312 20:40:35.940648 2152802 install.go:457] failed to run GPU driver installer: failed to extract precompiled artifacts: failed to read "/tmp/extract/kernel/precompiled": open /tmp/extract/kernel/precompiled: no such file or directory
bigbitbus commented 6 months ago

Latest from GCP support (see section on partial mitigation) image

tab1293 commented 6 months ago

Seeing this issue as well. Google support informed me that this is a p0 issue, so hopefully it is resolved soon

hatemosphere commented 6 months ago

the workaround is to simply migrate node pool to Ubuntu (UBUNTU_CONTAINERD) for L4's you'll have to disable automatic driver install and install 525 manually: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu

praveenperera commented 6 months ago

the workaround is to simply migrate node pool to Ubuntu (UBUNTU_CONTAINERD) for L4's you'll have to disable automatic driver install and install 525 manually: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu

Didn't work for me, the available driver is a lower 525, not sure why if thats why my workload didn't work.

davidhu2000 commented 6 months ago

the workaround is to simply migrate node pool to Ubuntu (UBUNTU_CONTAINERD) for L4's you'll have to disable automatic driver install and install 525 manually: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu

also note that you are stuck with COS if you're using GKE Autopilot.

rvadim commented 6 months ago

You can upload your own file(.run) to your GCP project's public bucket with structure like in code. https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/internal/installer/installer.go#41

Then modify the code, build it with go build -o cos-gpu-installer src/cmd/cos_gpu_installer/main.go Then build an image like:

$ cat Dockerfile 
FROM gcr.io/cos-cloud/cos-gpu-installer@sha256:8b8247d9d16ee92e13a33c6206c1727ff97e7dfa52d5732299493e3198bd4719

COPY cos-gpu-installer /cos-gpu-installer

, push it and use it in the daemonset EDIT: Works with L4, not sure about other.

abrowne2 commented 6 months ago

You can upload your own file(.run) to your GCP project's public bucket with structure like in code. https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/internal/installer/installer.go#41

Then modify the code, build it with go build -o cos-gpu-installer src/cmd/cos_gpu_installer/main.go Then build an image like:

$ cat Dockerfile 
FROM gcr.io/cos-cloud/cos-gpu-installer@sha256:8b8247d9d16ee92e13a33c6206c1727ff97e7dfa52d5732299493e3198bd4719

COPY cos-gpu-installer /cos-gpu-installer

, push it and use it in the daemonset EDIT: Works with L4, not sure about other.

Could do this, but what feels a bit better would be to

/cos-gpu-installer -allow-unsigned-driver -nvidia-installer-url=<FILE_DOMAIN>/NVIDIA-Linux-x86_64-525.125.06.run

Offhand, don't have a proper .run file (the nvidia hosted-files don't seem to extract right); if someone has a 525 .run file hosted, that'd be helpful.

yeldarby commented 6 months ago

It's back and working; this loads https://storage.googleapis.com/nvidia-drivers-us-public/tesla/535.129.03/NVIDIA-Linux-x86_64-535.129.03.run

bigbitbus commented 6 months ago

This is the resolution announcement email from GCP; they will publish an analysis of the outage later.

Thanks y'all for the productive ideas and workarounds, and moral support!

Closing the issue now that its fixed.

image

happy-machine commented 3 months ago

Having this issue with both latest and defult on l4 in us-central1-c

NAME↑ PF IMAGE READY STATE INIT │

│ check-socket ● gke.gcr.io/gke-distroless/bash@sha256:b7ef809eea4cc43476ad4c2628e3eb54faf2aecd08ee41cb4a432f1a383c7d92 false PodInitializing true │

│ nvidia-driver-installer ● cos-nvidia-installer:fixed false Running true │

│ nvidia-gpu-device-plugin ● gke.gcr.io/nvidia-gpu-device-plugin@sha256:13574d1d59e5063705064a30f22d93eb3994e97fbd850a5a19ffc33b48eb25cc false PodInitializing false │

│ partition-gpus ● gke.gcr.io/nvidia-partition-gpu@sha256:6b63ed9aae6061b9062f3898638c8018816d9d5916276a6c6abdb93115bfe3e3 false PodInitializing true

Here are the logs for the driver installer

┌─────────────────────────────────────── Logs(kube-system/nvidia-gpu-device-plugin-small-cos-wt9cb:nvidia-driver-installer)[tail] ────────────────────────────────────────┐

│ Autoscroll:On FullScreen:Off Timestamps:Off Wrap:Off │

│ % Total % Received % Xferd Average Speed Time Time Time Current │

│ Dload Upload Total Spent Left Speed │

│ 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 645 100 645 0 0 27519 0 --:--:-- --:--:-- --:--:-- 28043 │

│ GPU driver auto installation is disabled. │

│ Waiting for GPU driver libraries to be available.

agam commented 3 weeks ago

Seeing the same as above for a T4:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 100   720  100   720    0     0   113k      0 --:--:-- --:--:-- --:--:--  117k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.

This is blocking the usage of T4 nodes with Node-Auto-Provisioning, because the nvidia.com/gpu resource is not registered, and pods cannot be scheduled on them.