GoogleCloudPlatform / container-engine-accelerators

Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Apache License 2.0
212 stars 150 forks source link

How to use -host-path #49

Open wanglingsong opened 6 years ago

wanglingsong commented 6 years ago

I'm trying to install this plugin on my kubernetes nodes whose Nvidia driver is located at /usr/lib/nvidia-384

I read following instruction.

You can specify the directory on the host containing nvidia libraries using -host-path

But I don't know how to use it. And example script or command?

cmluciano commented 6 years ago

Can you try tweaking this env variable https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/daemonset.yaml#L56-L58

rohitagarwal003 commented 6 years ago

Can you try tweaking this env variable https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/daemonset.yaml#L56-L58

No, that's not what you need to do. That's for installing the drivers. @wanglingsong already seems to have drivers installed.

You need to add the -host-path when you start the device plugin.

./nvidia-gpu-device-plugin -host-path=/usr/lib/nvidia-384 -container-path=/usr/local/nvidia/lib64

For example, update https://github.com/kubernetes/kubernetes/blob/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml#L40,

wanglingsong commented 6 years ago

I tried following command arguments. However, the plug is still not up.

        command: ["/usr/bin/nvidia-gpu-device-plugin", "-logtostderr", "-host-path=/usr/lib/nvidia-384", "-container-path=/usr/local/nvidia/lib64"]
kubectl get daemonset --all-namespaces:

NAMESPACE     NAME                       DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
kube-system   kube-flannel-ds            1         1         1         1            1           beta.kubernetes.io/arch=amd64   1d
kube-system   kube-proxy                 1         1         1         1            1           <none>                          1d
kube-system   nvidia-gpu-device-plugin   0         0         0         0            0           <none>                          3m
rohitagarwal003 commented 6 years ago

Looks like you need to remove the node affinity which is GCP specific: https://github.com/kubernetes/kubernetes/blob/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml#L18-L24

wanglingsong commented 6 years ago

The plugin was up after removing the node affinity, but I still can't run CUDA application within container. I'm using Nvidia's cuda image.

switch@switch-PowerEdge-R730:~/esc/config/app$ sudo docker run --rm nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.
rohitagarwal003 commented 6 years ago

To use the device plugin, you need to use kubernetes to start the container.

See the pod spec in https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#v18-onwards

wanglingsong commented 6 years ago

I also tried to start it with kubernetes. The logs showed following error:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
rohitagarwal003 commented 6 years ago

Okay, so it found nvidia-smi binary but didn't find libcuda.so.1 library. Can you check where on the host, libcuda.so.1 is present? Is it not present under /usr/lib/nvidia-384?

therc commented 6 years ago

A missing libcuda.so.1 might be due to ldconfig never having been run.

wanglingsong commented 6 years ago

The nvidia driver is proper installed on my host indeed. My tensorflow container is working fine using Nvidia's official device plugin.