0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

bleachzk commented 6 years ago

I deployed device-plugin container on k8s via the guide. But when i run tensorflow-notebook (By exeucte kubectl create -f tensorflow-notebook.yml)，the pod was sill pending：

[root@mlssdi010001 k8s]# kubectl describe pod tf-notebook-747db6987b-86zts Name: tf-notebook-747db6987b-86zts .... Events: Type Reason Age From Message

Warning FailedScheduling 47s (x15 over 3m) default-scheduler 0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

Pod info：

[root@mlssdi010001 k8s]# kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default tf-notebook-747db6987b-86zts 0/1 Pending 0 5s
.... kube-system nvidia-device-plugin-daemonset-ljrwc 1/1 Running 0 34s 10.244.1.11 mlssdi010003
kube-system nvidia-device-plugin-daemonset-m7h2r 1/1 Running 0 34s 10.244.2.12 mlssdi010002

Nodes info：

NAME STATUS ROLES AGE VERSION mlssdi010001 Ready master 1d v1.9.0 mlssdi010002 Ready 1d v1.9.0 (GPU Node，1 Tesla M40) mlssdi010003 Ready 1d v1.9.0 (GPU Node，1 Tesla M40)

RenaudWasTaken commented 6 years ago

Hello!

Can you post the logs of the following:

kubectl describe nodes
the logs of the k8s-device-plugin pods (they are in the kube-system namespace):
- kubectl logs nvidia-device-plugin-daemonset-ljrwc --namespace kube-system
- kubectl logs nvidia-device-plugin-daemonset-m7h2r --namespace kube-system

bleachzk commented 6 years ago

description_of_nodes.txt logs_of_the_k8s-device-plugin_pods.txt

bleachzk commented 6 years ago

After edit /etc/docker/daemon.json as follow：

{ "exec-opts": ["native.cgroupdriver=systemd"], "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }

I get a new error:

[root@mlssdi010001 k8s]# kubectl logs nvidia-device-plugin-daemonset-qbdlh --namespace kube-system
2018/01/15 03:13:11 Loading NVML 2018/01/15 03:13:11 Fetching devices. nvidia-device-plugin: symbol lookup error: nvidia-device-plugin: undefined symbol: nvmlDeviceGetPciInfo_v3

pineking commented 6 years ago

@bleachzk see this comment https://github.com/NVIDIA/k8s-device-plugin/issues/19#issuecomment-355724269

RenaudWasTaken commented 6 years ago

Can you run the following on your GPU node while Kubelet is running (i.e: the node is in the k8s cluster) ?:

$ docker build -t nvidia/k8s-device-plugin:1.9 https://github.com/NVIDIA/k8s-device-plugin.git#v1.9
$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9

bleachzk commented 6 years ago

[root@mlssdi010003 k8s-device-plugin]# docker build -t nvidia/k8s-device-plugin:1.9 https://github.com/NVIDIA/k8s-device-plugin.git#v1.9 Sending build context to Docker daemon 7.56 MB Step 1/17 : FROM nvidia/cuda:9.0-base-ubuntu16.04 as build Error parsing reference: "nvidia/cuda:9.0-base-ubuntu16.04 as build" is not a valid repository/tag: invalid reference format

bleachzk commented 6 years ago

@pineking How to build k8s-device-plugin without docker？I tried to run command：

C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build

Get error as follow： error.log.txt

flx42 commented 6 years ago

What's your version of Go? And what's your version of Docker? I think you have old versions of both.

bleachzk commented 6 years ago

@flx42

[root@mlssdi010003 k8s-device-plugin]# go version go version go1.9.2 linux/amd64

[root@mlssdi010003 k8s-device-plugin]# docker --version Docker version 17.03.2-ce, build f5ec1e2

flx42 commented 6 years ago

Ok, your version of go is recent, so I'm not sure why it fails to build. Your version of docker is a bit old, it doesn't support multi-stage builds.

But still, the image provided on Docker Hub should work. What is the output of the following commands?

$  NVML_PATH=$(readlink --canonicalize $(ldconfig -p | awk '$1 == "libnvidia-ml.so.1" { print $4 }'))
$  echo $NVML_PATH
$  nm -D $NVML_PATH | grep nvmlDeviceGetPciInfo

Also, what's the version of your NVIDIA driver? (e.g. using nvidia-smi).

bleachzk commented 6 years ago

@flx42 Output of the commands:

[root@mlssdi010003 k8s-device-plugin]# echo $NVML_PATH /usr/lib64/libnvidia-ml.so.375.26 /usr/lib/libnvidia-ml.so.375.26

[root@mlssdi010003 k8s-device-plugin]# nm -D $NVML_PATH | grep nvmlDeviceGetPciInfo 0000000000019090 T nvmlDeviceGetPciInfo 0000000000023b10 T nvmlDeviceGetPciInfo_v2 0001bcc0 T nvmlDeviceGetPciInfo 000267f0 T nvmlDeviceGetPciInfo_v2

CUDA Version : 8.0.61 NVIDIA-SMI 375.26 Driver Version: 375.26

flx42 commented 6 years ago

Oh, I see, we need to compile the image against the CUDA 8.0 stubs, not the CUDA 9.0 stubs. Will be easy to fix.

flx42 commented 6 years ago

@bleachzk Pull the latest docker image for the device plugin and try again.

bleachzk commented 6 years ago

@flx42 Thanks~~~

dhli2 commented 5 years ago

I have the same problem, how should I solve it? thanks

RenaudWasTaken commented 5 years ago

Please open a new issue.

NVIDIA / k8s-device-plugin

0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu. #22