NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.81k stars 626 forks source link

GPU shows wrong number after any of the node in the cluster restart #277

Closed summerisc closed 2 years ago

summerisc commented 2 years ago
  1. Issue or feature description I have a four two-card cluster, it should show something like this after installing the plugin: Screen Shot 2021-11-12 at 1 57 03 PM which is working fine, but after I reboot any of one servers, for example, 1-9, the plugin cannot report the correct number of GPU, like this(I rebooted 1-9):

    Screen Shot 2021-11-12 at 1 57 51 PM Not only 1-9 shows 0(even after the reboot and 1-9 back online), 1-10 got effected and shows 0 GPUs(I didn't do anything on that node) Any ideas why this happened? Appriciate any insights.

  2. Steps to reproduce the issue

    1. have a multi node GPU cluster
    2. reboot any of the node and check the GPU count
    • [ ] The k8s-device-plugin container logs Screen Shot 2021-11-12 at 2 06 13 PM )
    • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
      11月 11 18:26:11 1-10 kubelet[2150]: I1111 18:26:11.992117    2150 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/nvidia-gpu.sock  <nil> 0 <nil>}] <nil> <nil>}
      11月 11 18:26:11 1-10 kubelet[2150]: I1111 18:26:11.991634    2150 manager.go:410] Got registration request from device plugin with resource name "nvidia.com/gpu"
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.607 [INFO][111135] k8s.go 476: Wrote updated endpoint to datastore ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Namespace="kube-system" Pod="nvidia-device-plugin-daemonset-tcfvw" WorkloadEndpoint="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0"
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.598 [INFO][111135] k8s.go 402: Added Mac, interface name, and active container ID to endpoint ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Namespace="kube-system" Pod="nvidia-device-plugin-daemonset-tcfvw" WorkloadEndpoint="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0", GenerateName:"nvidia-device-plugin-daemonset-", Namespace:"kube-system", SelfLink:"", UID:"3dd790a2-bd26-45e1-9ca4-b88802bd1df2", ResourceVersion:"48426783", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63772223162, loc:(*time.Location)(0x2b9b600)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"controller-revision-hash":"c4fb486c4", "name":"nvidia-device-plugin-ds", "pod-template-generation":"1", "projectcalico.org/namespace":"kube-system", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"default"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"1-10", ContainerID:"ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3", Pod:nvidia-device-plugin-daemonset-tcfvw", Endpoint:"eth0", ServiceAccountName:"default", IPNetworks:[]string{"10.244.133.29/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.kube-system", "ksa.kube-system.default"}, InterfaceName:"calie1b220cea47", MAC:"d6:e7:02:a7:f8:db", Ports:[]v3.EndpointPort(nil)}}
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.547 [INFO][111135] dataplane_linux.go 420: Disabling IPv4 forwarding ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Namespace="kube-system" Pod="nvidia-device-plugin-daemonset-tcfvw" WorkloadEndpoint="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0"
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.546 [INFO][111135] dataplane_linux.go 66: Setting the host side veth name to calie1b220cea47 ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Namespace="kube-system" Pod="nvidia-device-plugin-daemonset-tcfvw" WorkloadEndpoint="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0"
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.546 [INFO][111135] k8s.go 375: Calico CNI using IPs: [10.244.133.29/32] ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Namespace="kube-system" Pod="nvidia-device-plugin-daemonset-tcfvw" WorkloadEndpoint="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0"
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.546 [INFO][111135] k8s.go 374: Populated endpoint ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Namespace="kube-system" Pod="nvidia-device-plugin-daemonset-tcfvw" WorkloadEndpoint="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0", GenerateName:"nvidia-device-plugin-daemonset-", Namespace:"kube-system", SelfLink:"", UID:"3dd790a2-bd26-45e1-9ca4-b88802bd1df2", ResourceVersion:"48426783", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63772223162, loc:(*time.Location)(0x2b9b600)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"controller-revision-hash":"c4fb486c4", "name":"nvidia-device-plugin-ds", "pod-template-generation":"1", "projectcalico.org/namespace":"kube-system", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"default"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"1-10", ContainerID:"", Pod:"nvidia-device-plugin-daemonset-tcfvw", Endpoint:"eth0", ServiceAccountName:"default", IPNetworks:[]string{"10.244.133.29/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.kube-system", "ksa.kube-system.default"}, InterfaceName:"calie1b220cea47", MAC:"", Ports:[]v3.EndpointPort(nil)}}
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.539 [INFO][111155] ipam_plugin.go 276: Calico CNI IPAM assigned addresses IPv4=[10.244.133.29/26] IPv6=[] ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" HandleID="k8s-pod-network.ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Workload="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0"
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.490 [INFO][111155] ipam_plugin.go 265: Auto assigning IP ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" HandleID="k8s-pod-network.ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Workload="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0" assignArgs=ipam.AutoAssignArgs{Num4:1, Num6:0, HandleID:(*string)(0xc0003c5360), Attrs:map[string]string{"namespace":"kube-system", "node":"1-10", "pod":"nvidia-device-plugin-daemonset-tcfvw", "timestamp":"2021-11-11 10:26:03.47315434 +0000 UTC"}, Hostname:"1-10", IPv4Pools:[]net.IPNet{}, IPv6Pools:[]net.IPNet{}, MaxBlocksPerHost:0, HostReservedAttrIPv4s:(*ipam.HostReservedAttr)(nil), HostReservedAttrIPv6s:(*ipam.HostReservedAttr)(nil)}
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.473 [INFO][111155] ipam_plugin.go 226: Calico CNI IPAM request count IPv4=1 IPv6=0 ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" HandleID="k8s-pod-network.ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Workload="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0"
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.424 [INFO][111135] k8s.go 71: Extracted identifiers for CmdAddK8s ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Namespace="kube-system" Pod="nvidia-device-plugin-daemonset-tcfvw" WorkloadEndpoint="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0"
      11月 11 18:26:03 1-10 kubelet[2150]: 2021-11-11 18:26:03.424 [INFO][111135] plugin.go 260: Calico CNI found existing endpoint: &{{WorkloadEndpoint projectcalico.org/v3} {1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-eth0 nvidia-device-plugin-daemonset- kube-system  3dd790a2-bd26-45e1-9ca4-b88802bd1df2 48426783 0 2021-11-11 18:26:02 +0800 CST <nil> <nil> map[controller-revision-hash:c4fb486c4 name:nvidia-device-plugin-ds pod-template-generation:1 projectcalico.org/namespace:kube-system projectcalico.org/orchestrator:k8s projectcalico.org/serviceaccount:default] map[] [] []  []} {k8s  1-10  nvidia-device-plugin-daemonset-tcfvw eth0 default [] []   [kns.kube-system ksa.kube-system.default] calie1b220cea47  []}} ContainerID="ff2de5c40582d2dccef771c3cdbc24fd7d5143b92e16bb77c00aa96a66da6aa3" Namespace="kube-system" Pod="nvidia-device-plugin-daemonset-tcfvw" WorkloadEndpoint="1--10-k8s-nvidia--device--plugin--daemonset--tcfvw-"
      11月 11 18:26:02 1-10 kubelet[2150]: I1111 18:26:02.705212    2150 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-nxqmk" (UniqueName: "kubernetes.io/secret/3dd790a2-bd26-45e1-9ca4-b88802bd1df2-default-token-nxqmk") pod "nvidia-device-plugin-daemonset-tcfvw" (UID: "3dd790a2-bd26-45e1-9ca4-b88802bd1df2")
      11月 11 18:26:02 1-10 kubelet[2150]: I1111 18:26:02.705156    2150 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "device-plugin" (UniqueName: "kubernetes.io/host-path/3dd790a2-bd26-45e1-9ca4-b88802bd1df2-device-plugin") pod "nvidia-device-plugin-daemonset-tcfvw" (UID: "3dd790a2-bd26-45e1-9ca4-b88802bd1df2")
      11月 11 18:20:12 1-10 kubelet[2150]: W1111 18:20:12.849402    2150 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "nvidia-device-plugin-daemonset-md2vp_kube-system": unexpected command output Device "eth0" does not exist.
      11月 11 18:20:12 1-10 kubelet[2150]: 2021-11-11 18:20:12.833 [INFO][91040] ipam.go 1410: Releasing all IPs with handle 'kube-system.nvidia-device-plugin-daemonset-md2vp'

Additional information that might help better understand your environment and reproduce the bug:

elezar commented 2 years ago

Hi @summerisc which version of the device plugin is this, and what does nvidia-smi show on one of the affected hosts?

summerisc commented 2 years ago

I am using nvcr.io/nvidia/k8s-device-plugin:v0.9.0 this image. Nvidia-smi result Screen Shot 2021-11-12 at 5 27 23 PM Cuda/driver and GPU are the same across the nodes in the cluster.