Closed bleachzk closed 6 years ago
Hello!
Can you post the logs of the following:
kubectl describe nodes
kubectl logs nvidia-device-plugin-daemonset-ljrwc --namespace kube-system
kubectl logs nvidia-device-plugin-daemonset-m7h2r --namespace kube-system
After edit /etc/docker/daemon.json as follow:
{ "exec-opts": ["native.cgroupdriver=systemd"], "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
I get a new error:
[root@mlssdi010001 k8s]# kubectl logs nvidia-device-plugin-daemonset-qbdlh --namespace kube-system
2018/01/15 03:13:11 Loading NVML 2018/01/15 03:13:11 Fetching devices. nvidia-device-plugin: symbol lookup error: nvidia-device-plugin: undefined symbol: nvmlDeviceGetPciInfo_v3
@bleachzk see this comment https://github.com/NVIDIA/k8s-device-plugin/issues/19#issuecomment-355724269
Can you run the following on your GPU node while Kubelet is running (i.e: the node is in the k8s cluster) ?:
$ docker build -t nvidia/k8s-device-plugin:1.9 https://github.com/NVIDIA/k8s-device-plugin.git#v1.9
$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9
[root@mlssdi010003 k8s-device-plugin]# docker build -t nvidia/k8s-device-plugin:1.9 https://github.com/NVIDIA/k8s-device-plugin.git#v1.9 Sending build context to Docker daemon 7.56 MB Step 1/17 : FROM nvidia/cuda:9.0-base-ubuntu16.04 as build Error parsing reference: "nvidia/cuda:9.0-base-ubuntu16.04 as build" is not a valid repository/tag: invalid reference format
@pineking How to build k8s-device-plugin without docker?I tried to run command:
C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build
Get error as follow: error.log.txt
What's your version of Go? And what's your version of Docker? I think you have old versions of both.
@flx42
[root@mlssdi010003 k8s-device-plugin]# go version go version go1.9.2 linux/amd64
[root@mlssdi010003 k8s-device-plugin]# docker --version Docker version 17.03.2-ce, build f5ec1e2
Ok, your version of go is recent, so I'm not sure why it fails to build. Your version of docker is a bit old, it doesn't support multi-stage builds.
But still, the image provided on Docker Hub should work. What is the output of the following commands?
$ NVML_PATH=$(readlink --canonicalize $(ldconfig -p | awk '$1 == "libnvidia-ml.so.1" { print $4 }'))
$ echo $NVML_PATH
$ nm -D $NVML_PATH | grep nvmlDeviceGetPciInfo
Also, what's the version of your NVIDIA driver? (e.g. using nvidia-smi
).
@flx42 Output of the commands:
[root@mlssdi010003 k8s-device-plugin]# echo $NVML_PATH /usr/lib64/libnvidia-ml.so.375.26 /usr/lib/libnvidia-ml.so.375.26
[root@mlssdi010003 k8s-device-plugin]# nm -D $NVML_PATH | grep nvmlDeviceGetPciInfo 0000000000019090 T nvmlDeviceGetPciInfo 0000000000023b10 T nvmlDeviceGetPciInfo_v2 0001bcc0 T nvmlDeviceGetPciInfo 000267f0 T nvmlDeviceGetPciInfo_v2
CUDA Version : 8.0.61 NVIDIA-SMI 375.26 Driver Version: 375.26
Oh, I see, we need to compile the image against the CUDA 8.0 stubs, not the CUDA 9.0 stubs. Will be easy to fix.
@bleachzk Pull the latest docker image for the device plugin and try again.
@flx42 Thanks~~~
I have the same problem, how should I solve it? thanks
Please open a new issue.
I deployed device-plugin container on k8s via the guide. But when i run tensorflow-notebook (By exeucte kubectl create -f tensorflow-notebook.yml),the pod was sill pending:
Pod info:
Nodes info: