Nvidia driver failed while using cuda 10.0 in Kubernetes Cluster

limbuu commented 4 years ago

we run docker containers in GKE(Google Kubernetes Engine) with 12.x version with cuda 10.0 version and cudnn>7.6.5 . Actually the nvidia driver installed as per the docs through https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml gives 410.79 nvidia-driver in the cluster. But looks like the cuda installs different nvidia driver in the container and mismatch with the kernel version. While doing nvidia-smi in the container, we get: Failed to initialize NVML: Driver/library version mismatch How can we solve this issue?

theoallard commented 4 years ago

I have been getting similar errors. We also run jobs based on cuda 10.0 docker images from nvidia. Looking closer at the logs of the nvidia-driver-installer daemonset, it seems the driver and cuda versions the daemonset installs have changed between January 22nd and January 26th, from 410.79-cu10.0 to 418.67-cu10.1.

theoallard commented 4 years ago

Actually in our case this was because our node versions had been automatically updated. Turing off automatic update and reverting node version to 1.12.10-gke.17 fixed it

limbuu commented 4 years ago

@theoallard our node version is 1.12.10-gke.17 . I was able to do nvidia-smi. I fixed the environment variable paths too.

Here is a part of my docker image file.

Add cuda libraries

RUN apt-get update && apt-get install -y gnupg2 curl RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb RUN dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb RUN apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub RUN apt-get update && apt-get install cuda-10-0 -y

Install cuNN libraries

ADD ./libcudnn7_7.6.5.32-1+cuda10.0_amd64.deb libcudnn7_7.6.5.32-1+cuda10.0_amd64.deb RUN dpkg -i libcudnn7_7.6.5.32-1+cuda10.0_amd64.deb ADD ./libcudnn7-dev_7.6.5.32-1+cuda10.0_amd64.deb libcudnn7-dev_7.6.5.32-1+cuda10.0_amd64.deb RUN dpkg -i libcudnn7-dev_7.6.5.32-1+cuda10.0_amd64.deb

Add tensorflow

RUN pip install tensorflow-gpu==2.0.0

add environment variable paths

RUN whereis cuda-10.0 RUN whereis cuda ENV PATH=/usr/local/cuda-10.0/bin:/usr/local/nvidia/bin${PATH:+:${PATH}} ENV LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:/usr/local/nvidia/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

now link libraries to standard location

RUN ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.0/lib64/

Now, i am getting OOMKilling error while running tensorflow code.

This is weird, since i am using single Tesla T4 with 16GB RAM for each container.

limbuu commented 4 years ago

Deleting multiple kernel and processes, and running a single process at a time fixed the issue. Also, found out, the tensorflow-gpu==2.00 has memory allocation issue which uses max memory while loading variables with some random values. Thinking to switch tensorflow-gpu==2.1.0.

GoogleCloudPlatform / container-engine-accelerators