NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.5k stars 583 forks source link

0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu. #22

Closed bleachzk closed 6 years ago

bleachzk commented 6 years ago

I deployed device-plugin container on k8s via the guide. But when i run tensorflow-notebook (By exeucte kubectl create -f tensorflow-notebook.yml),the pod was sill pending:

[root@mlssdi010001 k8s]# kubectl describe pod tf-notebook-747db6987b-86zts Name: tf-notebook-747db6987b-86zts .... Events: Type Reason Age From Message


Warning FailedScheduling 47s (x15 over 3m) default-scheduler 0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

Pod info:

[root@mlssdi010001 k8s]# kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default tf-notebook-747db6987b-86zts 0/1 Pending 0 5s
.... kube-system nvidia-device-plugin-daemonset-ljrwc 1/1 Running 0 34s 10.244.1.11 mlssdi010003
kube-system nvidia-device-plugin-daemonset-m7h2r 1/1 Running 0 34s 10.244.2.12 mlssdi010002

Nodes info:

NAME STATUS ROLES AGE VERSION mlssdi010001 Ready master 1d v1.9.0 mlssdi010002 Ready 1d v1.9.0 (GPU Node,1 Tesla M40) mlssdi010003 Ready 1d v1.9.0 (GPU Node,1 Tesla M40)

RenaudWasTaken commented 6 years ago

Hello!

Can you post the logs of the following:

bleachzk commented 6 years ago

description_of_nodes.txt logs_of_the_k8s-device-plugin_pods.txt

bleachzk commented 6 years ago

After edit /etc/docker/daemon.json as follow:

{ "exec-opts": ["native.cgroupdriver=systemd"], "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }

I get a new error:

[root@mlssdi010001 k8s]# kubectl logs nvidia-device-plugin-daemonset-qbdlh --namespace kube-system
2018/01/15 03:13:11 Loading NVML 2018/01/15 03:13:11 Fetching devices. nvidia-device-plugin: symbol lookup error: nvidia-device-plugin: undefined symbol: nvmlDeviceGetPciInfo_v3

pineking commented 6 years ago

@bleachzk see this comment https://github.com/NVIDIA/k8s-device-plugin/issues/19#issuecomment-355724269

RenaudWasTaken commented 6 years ago

Can you run the following on your GPU node while Kubelet is running (i.e: the node is in the k8s cluster) ?:

$ docker build -t nvidia/k8s-device-plugin:1.9 https://github.com/NVIDIA/k8s-device-plugin.git#v1.9
$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9
bleachzk commented 6 years ago

[root@mlssdi010003 k8s-device-plugin]# docker build -t nvidia/k8s-device-plugin:1.9 https://github.com/NVIDIA/k8s-device-plugin.git#v1.9 Sending build context to Docker daemon 7.56 MB Step 1/17 : FROM nvidia/cuda:9.0-base-ubuntu16.04 as build Error parsing reference: "nvidia/cuda:9.0-base-ubuntu16.04 as build" is not a valid repository/tag: invalid reference format

bleachzk commented 6 years ago

@pineking How to build k8s-device-plugin without docker?I tried to run command:

C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build

Get error as follow: error.log.txt

flx42 commented 6 years ago

What's your version of Go? And what's your version of Docker? I think you have old versions of both.

bleachzk commented 6 years ago

@flx42

[root@mlssdi010003 k8s-device-plugin]# go version go version go1.9.2 linux/amd64

[root@mlssdi010003 k8s-device-plugin]# docker --version Docker version 17.03.2-ce, build f5ec1e2

flx42 commented 6 years ago

Ok, your version of go is recent, so I'm not sure why it fails to build. Your version of docker is a bit old, it doesn't support multi-stage builds.

But still, the image provided on Docker Hub should work. What is the output of the following commands?

$  NVML_PATH=$(readlink --canonicalize $(ldconfig -p | awk '$1 == "libnvidia-ml.so.1" { print $4 }'))
$  echo $NVML_PATH
$  nm -D $NVML_PATH | grep nvmlDeviceGetPciInfo

Also, what's the version of your NVIDIA driver? (e.g. using nvidia-smi).

bleachzk commented 6 years ago

@flx42 Output of the commands:

[root@mlssdi010003 k8s-device-plugin]# echo $NVML_PATH /usr/lib64/libnvidia-ml.so.375.26 /usr/lib/libnvidia-ml.so.375.26

[root@mlssdi010003 k8s-device-plugin]# nm -D $NVML_PATH | grep nvmlDeviceGetPciInfo 0000000000019090 T nvmlDeviceGetPciInfo 0000000000023b10 T nvmlDeviceGetPciInfo_v2 0001bcc0 T nvmlDeviceGetPciInfo 000267f0 T nvmlDeviceGetPciInfo_v2

CUDA Version : 8.0.61 NVIDIA-SMI 375.26 Driver Version: 375.26

flx42 commented 6 years ago

Oh, I see, we need to compile the image against the CUDA 8.0 stubs, not the CUDA 9.0 stubs. Will be easy to fix.

flx42 commented 6 years ago

@bleachzk Pull the latest docker image for the device plugin and try again.

bleachzk commented 6 years ago

@flx42 Thanks~~~

dhli2 commented 5 years ago

I have the same problem, how should I solve it? thanks

RenaudWasTaken commented 5 years ago

Please open a new issue.