Open PriyeshWani opened 5 years ago
@PriyeshWani The nvidia driver version is tied to the GKE node version, as defined in this section: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers
It's not really tied. It's just the version that's pre-loaded in the image. You should be able to pick the version you want by specifying the version in the env NVIDIA_DRIVER_VERSION. I haven't tested for the daemonset above but using the following: https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml
And then patching with:
kubectl patch daemonset -n kube-system nvidia-driver-installer --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"430.40"}]}]}}}}'
should do the job
Changing that variable won't work unless you can also convince someone to upload the matching version to https://storage.googleapis.com/nvidia-drivers-us-public/tesla/430.40/NVIDIA-Linux-x86_64-430.40-diagnostic.run. I just tried this trick to install 430.14 and ran into that. After it fails to download that package then it's going to go into the compiler but the container is not setup with a functioning build toolchain and so it fails to build. Or at least that's what I've experienced today.
I don't see an easy way to override this behavior without rebuilding the image.
@PriyeshWani The nvidia driver version is tied to the GKE node version, as defined in this section: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers
What are the timelines for updated drivers? As mentioned in the above comment - manually updating doesn't seem to be an option. Newer versions of pytorch now require CUDA 10.2. This is becoming a problem.
Yes would be nice to at least have a work around!
Looks like I have the same issue Provided logs here - https://github.com/GoogleCloudPlatform/cos-gpu-installer/issues/52
What is the alternative workaround to install drivers with version > v440.30 to GKE?
You'll have to build your own daemonset driver installer. Once the driver is installed everything else should work correctly. I don't know of any community project that does this already.
I have the same problem as @mikhno-s but I'm a bit confused. The files do exist for other versions as per @dwarburt 's comment. I'm following the logs and I can't figure out why the symptom of changing the driver version is permission denied.
I have the same problem as @mikhno-s but I'm a bit confused. The files do exist for other versions as per @dwarburt 's comment. I'm following the logs and I can't figure out why the symptom of changing the driver version is permission denied.
The files do not exist, I believe. I'm not sure what version you're trying to install but if it's a version that's not been uploaded to that bucket (nvidia-drivers-us-public) then the google installer won't work and you'll need to make your own.
I'm seeing lots of files...
$ gsutil ls gs://nvidia-drivers-eu-public/tesla/
gs://nvidia-drivers-eu-public/tesla/384.183/
gs://nvidia-drivers-eu-public/tesla/390.116/
gs://nvidia-drivers-eu-public/tesla/396.26/
gs://nvidia-drivers-eu-public/tesla/396.37/
gs://nvidia-drivers-eu-public/tesla/396.44/
gs://nvidia-drivers-eu-public/tesla/396.82/
gs://nvidia-drivers-eu-public/tesla/410.104/
gs://nvidia-drivers-eu-public/tesla/410.72/
gs://nvidia-drivers-eu-public/tesla/410.79/
gs://nvidia-drivers-eu-public/tesla/418.126.02/
gs://nvidia-drivers-eu-public/tesla/418.152/
gs://nvidia-drivers-eu-public/tesla/418.40.04/
gs://nvidia-drivers-eu-public/tesla/418.67/
gs://nvidia-drivers-eu-public/tesla/418.87.00/
gs://nvidia-drivers-eu-public/tesla/418.87.01/
gs://nvidia-drivers-eu-public/tesla/440.64.00/
gs://nvidia-drivers-eu-public/tesla/440.95.01/
gs://nvidia-drivers-eu-public/tesla/450.51.05/
gs://nvidia-drivers-eu-public/tesla/450.51.06/
Same for the US. And looking inside the directories shows similar contents as 418.67. What am I missing?
Oof, just realized I'm looking in the wrong path. Nevermind...
Well, after snooping around in the public buckets I think I found one that works: 440.64.00. 🎉
@hsharrison were you able to install these 440? if so, could you provide some files :pray:
I didn't have to do anything fancy.
# Install driver installer daemonset.
kubectl apply \
--filename https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
# Upddate nvidia driver version.
kubectl patch daemonset nvidia-driver-installer \
-namespace kube-system \
--patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"440.64.00"}]}]}}}}
@hsharrison Did it immediately work for you without any further actions?
Anyone found a working daemonset for GKE 1.16?
Anyone found a working daemonset for GKE 1.16?
Have you tried using https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-nvidia-v450.yaml ? This daemonset does not require a preloaded image, and it's version is 450.
@ruiwen-zhao Yes i patched my cluster with this command:
kubectl patch daemonset nvidia-driver-installer --namespace kube-system --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"450.51.06"}]}]}}}}'
Thing is the driver are in a crashloop backoff. As gke will force upgrade us to 1.16 I am trying to find a fix on our staging environement. If I provide logs would you be able to help me debug this?
@ruiwen-zhao Yes i patched my cluster with this command:
kubectl patch daemonset nvidia-driver-installer --namespace kube-system --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"450.51.06"}]}]}}}}'
Thing is the driver are in a crashloop backoff. As gke will force upgrade us to 1.16 I am trying to find a fix on our staging environement. If I provide logs would you be able to help me debug this?
Hey yeah if you can provide the logs from the installer and I will see what I can find.
Also, can you try just using the daemonset I provided, instead of patching the existing one? There are some other differences than driver version.
That dameonset worked for me on 1.16. Remind me to avoid GKE upgrades from now on, though. It was pretty late into the night before I figured out to try it.
With the new versions of the cos-gpu-installer you have to rebuild the image and change the env file to install the version that you want. The alternative is to use an old image from the google registry. You can use gcr.io/cos-cloud/cos-gpu-installer:v20200701 and add an env variable: NVIDIA_DRIVER_VERSION=430.40 To do so you can either download this daemonset and edit it or use this kind of command:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
kubectl patch daemonset nvidia-driver-installer --namespace kube-system --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","image":"gcr.io/cos-cloud/cos-gpu-installer:v20200701","imagePullPolicy":"IfNotPresent","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"430.40"}]}]}}}}'
It will work if the Nvidia Driver version you are trying to install is available in your region. Check with this command https://github.com/GoogleCloudPlatform/container-engine-accelerators/issues/119#issuecomment-668638011
How do I go about updating the nvidia driver version to 430.40? Right now, in my env, it is setting up 410.x Where is it getting 410 from?
From the script it seems the default is set to 418
I am using https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/cos/daemonset-preloaded.yaml to install daemonset on my nodes.
Thanks