GoogleCloudPlatform / container-engine-accelerators

Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Apache License 2.0
214 stars 151 forks source link

Installing the dameonset with nvidia-driver version 430.40 #119

Open PriyeshWani opened 5 years ago

PriyeshWani commented 5 years ago

How do I go about updating the nvidia driver version to 430.40? Right now, in my env, it is setting up 410.x Where is it getting 410 from?

From the script it seems the default is set to 418

I am using https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/cos/daemonset-preloaded.yaml to install daemonset on my nodes.

Thanks

chardch commented 4 years ago

@PriyeshWani The nvidia driver version is tied to the GKE node version, as defined in this section: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers

lutierigb commented 4 years ago

It's not really tied. It's just the version that's pre-loaded in the image. You should be able to pick the version you want by specifying the version in the env NVIDIA_DRIVER_VERSION. I haven't tested for the daemonset above but using the following: https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml

And then patching with:

kubectl patch daemonset -n kube-system nvidia-driver-installer --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"430.40"}]}]}}}}' should do the job

dwarburt commented 4 years ago

Changing that variable won't work unless you can also convince someone to upload the matching version to https://storage.googleapis.com/nvidia-drivers-us-public/tesla/430.40/NVIDIA-Linux-x86_64-430.40-diagnostic.run. I just tried this trick to install 430.14 and ran into that. After it fails to download that package then it's going to go into the compiler but the container is not setup with a functioning build toolchain and so it fails to build. Or at least that's what I've experienced today.

https://github.com/GoogleCloudPlatform/cos-gpu-installer/blob/master/cos-gpu-installer-docker/gpu_installer_url_lib.sh#L66

I don't see an easy way to override this behavior without rebuilding the image.

viveklak commented 4 years ago

@PriyeshWani The nvidia driver version is tied to the GKE node version, as defined in this section: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers

What are the timelines for updated drivers? As mentioned in the above comment - manually updating doesn't seem to be an option. Newer versions of pytorch now require CUDA 10.2. This is becoming a problem.

pievalentin commented 4 years ago

Yes would be nice to at least have a work around!

mikhno-s commented 4 years ago

Looks like I have the same issue Provided logs here - https://github.com/GoogleCloudPlatform/cos-gpu-installer/issues/52

What is the alternative workaround to install drivers with version > v440.30 to GKE?

dwarburt commented 4 years ago

You'll have to build your own daemonset driver installer. Once the driver is installed everything else should work correctly. I don't know of any community project that does this already.

hsharrison commented 4 years ago

I have the same problem as @mikhno-s but I'm a bit confused. The files do exist for other versions as per @dwarburt 's comment. I'm following the logs and I can't figure out why the symptom of changing the driver version is permission denied.

dwarburt commented 4 years ago

I have the same problem as @mikhno-s but I'm a bit confused. The files do exist for other versions as per @dwarburt 's comment. I'm following the logs and I can't figure out why the symptom of changing the driver version is permission denied.

The files do not exist, I believe. I'm not sure what version you're trying to install but if it's a version that's not been uploaded to that bucket (nvidia-drivers-us-public) then the google installer won't work and you'll need to make your own.

hsharrison commented 4 years ago

I'm seeing lots of files...

$ gsutil ls gs://nvidia-drivers-eu-public/tesla/
gs://nvidia-drivers-eu-public/tesla/384.183/
gs://nvidia-drivers-eu-public/tesla/390.116/
gs://nvidia-drivers-eu-public/tesla/396.26/
gs://nvidia-drivers-eu-public/tesla/396.37/
gs://nvidia-drivers-eu-public/tesla/396.44/
gs://nvidia-drivers-eu-public/tesla/396.82/
gs://nvidia-drivers-eu-public/tesla/410.104/
gs://nvidia-drivers-eu-public/tesla/410.72/
gs://nvidia-drivers-eu-public/tesla/410.79/
gs://nvidia-drivers-eu-public/tesla/418.126.02/
gs://nvidia-drivers-eu-public/tesla/418.152/
gs://nvidia-drivers-eu-public/tesla/418.40.04/
gs://nvidia-drivers-eu-public/tesla/418.67/
gs://nvidia-drivers-eu-public/tesla/418.87.00/
gs://nvidia-drivers-eu-public/tesla/418.87.01/
gs://nvidia-drivers-eu-public/tesla/440.64.00/
gs://nvidia-drivers-eu-public/tesla/440.95.01/
gs://nvidia-drivers-eu-public/tesla/450.51.05/
gs://nvidia-drivers-eu-public/tesla/450.51.06/

Same for the US. And looking inside the directories shows similar contents as 418.67. What am I missing?

hsharrison commented 4 years ago

Oof, just realized I'm looking in the wrong path. Nevermind...

hsharrison commented 4 years ago

Well, after snooping around in the public buckets I think I found one that works: 440.64.00. 🎉

pasikon commented 4 years ago

@hsharrison were you able to install these 440? if so, could you provide some files :pray:

hsharrison commented 4 years ago

I didn't have to do anything fancy.

# Install driver installer daemonset.
kubectl apply \
  --filename https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
# Upddate nvidia driver version.
kubectl patch daemonset nvidia-driver-installer \
  -namespace kube-system \
  --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"440.64.00"}]}]}}}}
hilsenrat commented 4 years ago

@hsharrison Did it immediately work for you without any further actions?

pievalentin commented 4 years ago

Anyone found a working daemonset for GKE 1.16?

ruiwen-zhao commented 4 years ago

Anyone found a working daemonset for GKE 1.16?

Have you tried using https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-nvidia-v450.yaml ? This daemonset does not require a preloaded image, and it's version is 450.

pievalentin commented 4 years ago

@ruiwen-zhao Yes i patched my cluster with this command:

kubectl patch daemonset nvidia-driver-installer --namespace kube-system --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"450.51.06"}]}]}}}}'

Thing is the driver are in a crashloop backoff. As gke will force upgrade us to 1.16 I am trying to find a fix on our staging environement. If I provide logs would you be able to help me debug this?

ruiwen-zhao commented 4 years ago

@ruiwen-zhao Yes i patched my cluster with this command:

kubectl patch daemonset nvidia-driver-installer --namespace kube-system --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"450.51.06"}]}]}}}}'

Thing is the driver are in a crashloop backoff. As gke will force upgrade us to 1.16 I am trying to find a fix on our staging environement. If I provide logs would you be able to help me debug this?

Hey yeah if you can provide the logs from the installer and I will see what I can find.

Also, can you try just using the daemonset I provided, instead of patching the existing one? There are some other differences than driver version.

hsharrison commented 4 years ago

That dameonset worked for me on 1.16. Remind me to avoid GKE upgrades from now on, though. It was pretty late into the night before I figured out to try it.

NathanGuyot commented 2 years ago

With the new versions of the cos-gpu-installer you have to rebuild the image and change the env file to install the version that you want. The alternative is to use an old image from the google registry. You can use gcr.io/cos-cloud/cos-gpu-installer:v20200701 and add an env variable: NVIDIA_DRIVER_VERSION=430.40 To do so you can either download this daemonset and edit it or use this kind of command:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

kubectl patch daemonset nvidia-driver-installer --namespace kube-system --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","image":"gcr.io/cos-cloud/cos-gpu-installer:v20200701","imagePullPolicy":"IfNotPresent","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"430.40"}]}]}}}}'

It will work if the Nvidia Driver version you are trying to install is available in your region. Check with this command https://github.com/GoogleCloudPlatform/container-engine-accelerators/issues/119#issuecomment-668638011