Open wanglingsong opened 6 years ago
Can you try tweaking this env variable https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/daemonset.yaml#L56-L58
Can you try tweaking this env variable https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/daemonset.yaml#L56-L58
No, that's not what you need to do. That's for installing the drivers. @wanglingsong already seems to have drivers installed.
You need to add the -host-path
when you start the device plugin.
./nvidia-gpu-device-plugin -host-path=/usr/lib/nvidia-384 -container-path=/usr/local/nvidia/lib64
For example, update https://github.com/kubernetes/kubernetes/blob/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml#L40,
I tried following command arguments. However, the plug is still not up.
command: ["/usr/bin/nvidia-gpu-device-plugin", "-logtostderr", "-host-path=/usr/lib/nvidia-384", "-container-path=/usr/local/nvidia/lib64"]
kubectl get daemonset --all-namespaces:
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system kube-flannel-ds 1 1 1 1 1 beta.kubernetes.io/arch=amd64 1d
kube-system kube-proxy 1 1 1 1 1 <none> 1d
kube-system nvidia-gpu-device-plugin 0 0 0 0 0 <none> 3m
Looks like you need to remove the node affinity which is GCP specific: https://github.com/kubernetes/kubernetes/blob/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml#L18-L24
The plugin was up after removing the node affinity, but I still can't run CUDA application within container. I'm using Nvidia's cuda image.
switch@switch-PowerEdge-R730:~/esc/config/app$ sudo docker run --rm nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.
To use the device plugin, you need to use kubernetes to start the container.
See the pod spec in https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#v18-onwards
I also tried to start it with kubernetes. The logs showed following error:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Okay, so it found nvidia-smi
binary but didn't find libcuda.so.1
library. Can you check where on the host, libcuda.so.1
is present? Is it not present under /usr/lib/nvidia-384
?
A missing libcuda.so.1 might be due to ldconfig never having been run.
The nvidia driver is proper installed on my host indeed. My tensorflow container is working fine using Nvidia's official device plugin.
I'm trying to install this plugin on my kubernetes nodes whose Nvidia driver is located at /usr/lib/nvidia-384
I read following instruction.
But I don't know how to use it. And example script or command?