Open tjliupeng opened 5 years ago
I guess it's because Nvidia's customized kubernetes only has v1alpha version. It's Nvidia's implementation. It's not compatible with Kuberentes' community version.
I think it will be fine if you try Kubernetes community version: https://github.com/kubernetes/kubernetes/blob/release-1.10/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto. I recommend you choose the 1.11 k8s or above.
We have a K8s cluster which the K8s is a Nvidia-customized version for dgx. It is based on 1.10.8. We just try to check whether the gpushare device plugin works on it. Check the docker log, we find that the device plugin fail to register to kubelet registration service through /var/lib/kubelet/device-plugins/kubelet.socket. I also check this unix socket. It is opened by kubelet and at the LISTENING status. What might cause the "unknown service v1beta1.Registration" error? Thx.