GoogleCloudPlatform / container-engine-accelerators

Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Apache License 2.0
211 stars 150 forks source link

Is it okay to change Nvidia Driver version other than that given by Daemonset mentioned in GPU for Kubernetes Cluster Docs #134

Open limbuu opened 4 years ago

limbuu commented 4 years ago

We are currently using nvidia driver daemonset as given in https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml Unfortunately it supports 410.79 nvidia driver which is compatible with kubernetes 1.12X version. But, we want to upgrade our driver to 440.x and also upgrade our cuda libraries which is currently cuda-10.0, we have also used Telsa T4 with tensor cores for GPU hardware. We are already at production but, some of our tensorflow code are not running. We have used tensorflow-gpu=2.0.0. Most surprising fact is, the same code runs perfectly on local computer with GPU support. The GPU specs used in local computer are Nivida GTX 1060 and 1080. It has become hard to track what issue might have caused the problem.

allanlei commented 4 years ago

We ran into this requirement also working with ffmpeg and scale_npp filter producing corrupt video using T4 and 410.79

  1. gsutil ls gs://nvidia-drivers-asia-public/tesla (Replace asia with your GCP region)
  2. Find the driver version you want (440.64.00)
  3. Set the following in the daemonset
env:
- name: NVIDIA_DRIVER_VERSION
  value: "440.64.00"
  1. Re-install daemonset

FYI, looking at https://github.com/GoogleCloudPlatform/cos-gpu-installer, the image has been updated recently to use 440.64.00, so this is more of a good to know for the future when there are new drivers.