AliyunContainerService / gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Apache License 2.0
468 stars 144 forks source link

device login fail to register #6

Open tjliupeng opened 5 years ago

tjliupeng commented 5 years ago

We have a K8s cluster which the K8s is a Nvidia-customized version for dgx. It is based on 1.10.8. We just try to check whether the gpushare device plugin works on it. Check the docker log, we find that the device plugin fail to register to kubelet registration service through /var/lib/kubelet/device-plugins/kubelet.socket. I also check this unix socket. It is opened by kubelet and at the LISTENING status. What might cause the "unknown service v1beta1.Registration" error? Thx.

I0306 08:43:57.132717       1 main.go:18] Start gpushare device plugin
I0306 08:43:57.132779       1 gpumanager.go:28] Loading NVML
I0306 08:43:57.134391       1 gpumanager.go:37] Fetching devices.
I0306 08:43:57.134409       1 gpumanager.go:43] Starting FS watcher.
I0306 08:43:57.134475       1 gpumanager.go:51] Starting OS watcher.
I0306 08:43:57.141623       1 nvidia.go:64] Deivce GPU-95061e03-5740-5360-4968-f9c567395f4a's Path is /dev/nvidia0
I0306 08:43:57.141650       1 nvidia.go:69] # device Memory: 8116
I0306 08:43:57.141655       1 nvidia.go:40] set gpu memory: 7
I0306 08:43:57.141660       1 nvidia.go:76] # Add first device ID: GPU-95061e03-5740-5360-4968-f9c567395f4a-_-0
I0306 08:43:57.141665       1 nvidia.go:79] # Add last device ID: GPU-95061e03-5740-5360-4968-f9c567395f4a-_-6
I0306 08:43:57.141669       1 server.go:43] Device Map: map[GPU-95061e03-5740-5360-4968-f9c567395f4a:0]
I0306 08:43:57.141679       1 server.go:44] Device List: [GPU-95061e03-5740-5360-4968-f9c567395f4a]
I0306 08:43:57.159087       1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I0306 08:43:57.159595       1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I0306 08:43:57.160404       1 server.go:226] Could not register device plugin: rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
W0306 08:43:57.160522       1 gpumanager.go:66] Failed to start device plugin due to rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
I0306 08:43:57.161182       1 nvidia.go:64] Deivce GPU-95061e03-5740-5360-4968-f9c567395f4a's Path is /dev/nvidia0
cheyang commented 5 years ago

I guess it's because Nvidia's customized kubernetes only has v1alpha version. It's Nvidia's implementation. It's not compatible with Kuberentes' community version.

I think it will be fine if you try Kubernetes community version: https://github.com/kubernetes/kubernetes/blob/release-1.10/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto. I recommend you choose the 1.11 k8s or above.