GoogleCloudPlatform / container-engine-accelerators

Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Apache License 2.0
214 stars 151 forks source link

Driver upgrade is not possible #200

Open adityapatadia opened 3 years ago

adityapatadia commented 3 years ago

We are using this guide to install drivers: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers

Now, the drivers for COS are locked and it always installs 450.119.04. We want to upgrade driver to version 460.32.03 because https://github.com/FFmpeg/nv-codec-headers needs driver version 455.28 or newer.

How can we upgrade driver version?

andreasjansson commented 2 years ago

@adityapatadia Did you figure out a way to do upgrade driver versions? I'm having the same issue.

Endofunctor commented 2 years ago

For anyone that finds themselves with this problem, you can untie your CUDA driver version from your COS version with the following steps:

  1. Download the daemonset-preloaded-latest.
  2. Check the cos-tools bucket: gsutil ls gs://cos-tools/ for a newer cos version than the one in your cluster.
  3. Check under gsutil ls gs://cos-tools/<newer COS version>/extensions/gpu (note if the COS version is 16928.0.0 or newer, this folder does not appear to exist, keep the --version=latest in the next step.
  4. Change the command in the daemonset-preloaded-latest.yaml to command: ['/cos-gpu-installer', 'install', '--allow-unsigned-driver', '--version=<driver version found under extensions/gpu, just the #>', '--gcs-download-prefix=<newer COS version>'. For me I had COS version 16108.604.3 which pinned CUDA 450.119.04, I was able to use the COS 16623.102.4 version's 470.82.01. CUDA driver.

Other notes: It may be possible to directly specify the driver url with versions found under gs://nvidia-drivers-us-public/tesla/. The command would be (for example): command: [ '/cos-gpu-installer', 'install', '--allow-unsigned-driver', '--nvidia-installer-url=https://storage.googleapis.com/nvidia-drivers-us-public/tesla/510.47.03/NVIDIA-Linux-x86_64-510.47.03.run' ]. I tested this once but there is an issue with the precompiled COS toolchain that gets downloaded. It may be possible to fix this or this issue may not occur at all with a different COS version than I have. You also try specifying, --gcs-download-prefix for a different COS toolchain version and see if that works. I did not get a chance to confirm as I timeboxed this driver upgrade to 2 hours.

DavraYoung commented 1 year ago

Where can we find information on how to control driver and cuda version? This becomes really challenging, given no information for deploying gpus in gke(