GoogleCloudPlatform / container-engine-accelerators

Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Apache License 2.0
211 stars 150 forks source link

Nvidia driver failed while using cuda 10.0 in Kubernetes Cluster #133

Open limbuu opened 4 years ago

limbuu commented 4 years ago

we run docker containers in GKE(Google Kubernetes Engine) with 12.x version with cuda 10.0 version and cudnn>7.6.5 . Actually the nvidia driver installed as per the docs through https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml gives 410.79 nvidia-driver in the cluster. But looks like the cuda installs different nvidia driver in the container and mismatch with the kernel version. While doing nvidia-smi in the container, we get: Failed to initialize NVML: Driver/library version mismatch How can we solve this issue?

theoallard commented 4 years ago

I have been getting similar errors. We also run jobs based on cuda 10.0 docker images from nvidia. Looking closer at the logs of the nvidia-driver-installer daemonset, it seems the driver and cuda versions the daemonset installs have changed between January 22nd and January 26th, from 410.79-cu10.0 to 418.67-cu10.1.

theoallard commented 4 years ago

Actually in our case this was because our node versions had been automatically updated. Turing off automatic update and reverting node version to 1.12.10-gke.17 fixed it

limbuu commented 4 years ago

@theoallard our node version is 1.12.10-gke.17 . I was able to do nvidia-smi. I fixed the environment variable paths too. image

Here is a part of my docker image file.

Add cuda libraries

RUN apt-get update && apt-get install -y gnupg2 curl RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb RUN dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb RUN apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub RUN apt-get update && apt-get install cuda-10-0 -y

Install cuNN libraries

ADD ./libcudnn7_7.6.5.32-1+cuda10.0_amd64.deb libcudnn7_7.6.5.32-1+cuda10.0_amd64.deb RUN dpkg -i libcudnn7_7.6.5.32-1+cuda10.0_amd64.deb ADD ./libcudnn7-dev_7.6.5.32-1+cuda10.0_amd64.deb libcudnn7-dev_7.6.5.32-1+cuda10.0_amd64.deb RUN dpkg -i libcudnn7-dev_7.6.5.32-1+cuda10.0_amd64.deb

Add tensorflow

RUN pip install tensorflow-gpu==2.0.0

add environment variable paths

RUN whereis cuda-10.0 RUN whereis cuda ENV PATH=/usr/local/cuda-10.0/bin:/usr/local/nvidia/bin${PATH:+:${PATH}} ENV LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:/usr/local/nvidia/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

now link libraries to standard location

RUN ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.0/lib64/

Now, i am getting OOMKilling error while running tensorflow code.

image

This is weird, since i am using single Tesla T4 with 16GB RAM for each container.

limbuu commented 4 years ago

Deleting multiple kernel and processes, and running a single process at a time fixed the issue. Also, found out, the tensorflow-gpu==2.00 has memory allocation issue which uses max memory while loading variables with some random values. Thinking to switch tensorflow-gpu==2.1.0.