Open yingjunxu opened 3 years ago
[foot@vz01sjzn06 ~]$ kubectl describe pod gpushare-device-plugin-ds-x525n -n kube-system
Name: gpushare-device-plugin-ds-x525n
Namespace: kube-system
Node: 10.78.189.132/10.78.189.132
Start Time: Fri, 16 Jul 2021 14:19:18 +0800
Labels: app=gpushare
component=gpushare-device-plugin
controller-revision-hash=7d7c7b4666
name=gpushare-device-plugin-ds
pod-template-generation=5
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 10.78.189.132
Controlled By: DaemonSet/gpushare-device-plugin-ds
Containers:
gpushare:
Container ID: docker://31022baa09b663cd5400c6683d29ec42fe919851bcb1f82a345069cbcd430884
Image: k8s-gpushare-plugin:1.11
Image ID: docker://sha256:e424cbbbfc1b02f3f4f4da715efdf1c1ced9034bfd049327bc004824acd37d39
Port:
gpushare-device-plugin-token-xs4m4:
Type: Secret (a volume populated by a Secret)
SecretName: gpushare-device-plugin-token-xs4m4
Optional: false
QoS Class: Guaranteed
Node-Selectors: gpushare=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
Normal Scheduled 30m default-scheduler Successfully assigned kube-system/gpushare-device-plugin-ds-x525n to 10.78.189.132 Normal Pulled 30m kubelet, 10.78.189.132 Container image "k8s-gpushare-plugin:1.11" already present on machine Normal Created 30m kubelet, 10.78.189.132 Created container gpushare Normal Started 30m kubelet, 10.78.189.132 Started container gpushare
sorry to reply so late. but if you still have this problem, pls update your device plugin startup command parameter in deamonset yaml as --memory-unit=GiB. To set it as MiB, means the device plugin will report each MiB of GPU memory as 1 virtual device to kubelet. In that case, it reports 22918 * 4 virtual devices in your GPU node. To transfer all those device info of that large volume, will be very probable to exceed grpc msg size limit
@wsxiaozhang I don't really know, is it possible to raise msg size limit on plugin's side?
1.gpu node status: [foot@vz01sjzn06 ~]$ kubectl describe node 10.78.189.132 Name: 10.78.189.132 Roles:
Labels: asiainfo.com-gpu.type=gpu
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
gpushare=true
hardware-type=NVIDIAGPU
kubernetes.io/arch=amd64
kubernetes.io/hostname=10.78.189.132
kubernetes.io/os=linux
testywh=testywh
Annotations: node.alpha.kubernetes.io/ttl: 15
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 09 Jul 2021 19:34:08 +0800
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
MemoryPressure False Fri, 16 Jul 2021 14:42:05 +0800 Fri, 09 Jul 2021 19:34:08 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Fri, 16 Jul 2021 14:42:05 +0800 Fri, 09 Jul 2021 19:34:08 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Fri, 16 Jul 2021 14:42:05 +0800 Fri, 09 Jul 2021 19:34:08 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Fri, 16 Jul 2021 14:42:05 +0800 Fri, 09 Jul 2021 19:34:15 +0800 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.78.189.132 Hostname: 10.78.189.132 Capacity: aliyun.com/gpu-count: 4 aliyun.com/gpu-mem: 0 cpu: 48 ephemeral-storage: 2339671872Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263714228Ki nvidia.com/gpu: 0 pods: 110 Allocatable: aliyun.com/gpu-count: 4 aliyun.com/gpu-mem: 0 cpu: 48 ephemeral-storage: 2156241593666 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263611828Ki nvidia.com/gpu: 0 pods: 110 System Info: Machine ID: f2251201691f432a9608dc1896133387 System UUID: B08C2D74-475E-11E8-874C-60DEF3F38020 Boot ID: 5d575ce6-e845-412a-a9cf-b2f3cbddaf9d Kernel Version: 3.10.0-1062.el7.x86_64 OS Image: CentOS Linux 7 (Core) Operating System: linux Architecture: amd64 Container Runtime Version: docker://18.9.8 Kubelet Version: v1.15.6 Kube-Proxy Version: v1.15.6 PodCIDR: 172.30.54.0/24 Non-terminated Pods: (8 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
kube-system dfclient-djk2j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 6d19h kube-system gpushare-device-plugin-ds-x525n 1 (2%) 1 (2%) 300Mi (0%) 300Mi (0%) 23m monitoring gpu-metrics-exporter-9b6vh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d18h monitoring node-exporter-5gnqg 112m (0%) 270m (0%) 200Mi (0%) 220Mi (0%) 6d19h monitoring node-gpu-exporter-787m2 100m (0%) 200m (0%) 30Mi (0%) 50Mi (0%) 45h ns-sqrxlj deploy-zzjtdxbqsfw-6969665b4d-c6f5b 3 (6%) 9 (18%) 8589934592 (3%) 25769803776 (9%) 38h ns-zqjh-cjr deploy-zqjh-cjr-yzjcdfw-6c455ffc8-szc7q 2 (4%) 2500m (5%) 4294967296 (1%) 5368709120 (1%) 41h ns-zqjh-cjr deploy-zqjh-cjr-yzjcfw-846fdcd5f-4gfgs 4 (8%) 5 (10%) 8589934592 (3%) 10737418240 (3%) 39h Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits
cpu 10212m (21%) 17970m (37%) memory 21010Mi (8%) 40506Mi (15%) ephemeral-storage 0 (0%) 0 (0%) aliyun.com/gpu-count 0 0 aliyun.com/gpu-mem 0 0 nvidia.com/gpu 0 0 Events:
2. [foot@vz01sjzn06 ~]$ kubectl logs gpushare-device-plugin-ds-x525n -n kube-system I0716 06:19:20.093412 1 main.go:18] Start gpushare device plugin I0716 06:19:20.093495 1 gpumanager.go:28] Loading NVML I0716 06:19:20.100355 1 gpumanager.go:37] Fetching devices. I0716 06:19:20.100389 1 gpumanager.go:43] Starting FS watcher. I0716 06:19:20.100450 1 gpumanager.go:51] Starting OS watcher. I0716 06:19:20.114808 1 nvidia.go:64] Deivce GPU-7abbb7a9-40e1-053c-f605-e15bb57eae62's Path is /dev/nvidia0 I0716 06:19:20.114869 1 nvidia.go:69] # device Memory: 22919 I0716 06:19:20.114878 1 nvidia.go:40] set gpu memory: 22919 I0716 06:19:20.114885 1 nvidia.go:76] # Add first device ID: GPU-7abbb7a9-40e1-053c-f605-e15bb57eae62--0 I0716 06:19:20.127370 1 nvidia.go:79] # Add last device ID: GPU-7abbb7a9-40e1-053c-f605-e15bb57eae62--22918 I0716 06:19:20.142416 1 nvidia.go:64] Deivce GPU-eaa0c1b3-6adc-a672-bf88-90837d41d729's Path is /dev/nvidia1 I0716 06:19:20.142442 1 nvidia.go:69] # device Memory: 22919 I0716 06:19:20.142451 1 nvidia.go:76] # Add first device ID: GPU-eaa0c1b3-6adc-a672-bf88-90837d41d729--0 I0716 06:19:20.204632 1 nvidia.go:79] # Add last device ID: GPU-eaa0c1b3-6adc-a672-bf88-90837d41d729--22918 I0716 06:19:20.213805 1 nvidia.go:64] Deivce GPU-a9fa46b3-ef3a-795a-f8bc-8963e56fb3d6's Path is /dev/nvidia2 I0716 06:19:20.213827 1 nvidia.go:69] # device Memory: 22919 I0716 06:19:20.213836 1 nvidia.go:76] # Add first device ID: GPU-a9fa46b3-ef3a-795a-f8bc-8963e56fb3d6--0 I0716 06:19:20.222989 1 nvidia.go:79] # Add last device ID: GPU-a9fa46b3-ef3a-795a-f8bc-8963e56fb3d6--22918 I0716 06:19:20.238833 1 nvidia.go:64] Deivce GPU-0340656d-d334-3b85-7d57-97eb4fe17a8d's Path is /dev/nvidia3 I0716 06:19:20.238848 1 nvidia.go:69] # device Memory: 22919 I0716 06:19:20.238854 1 nvidia.go:76] # Add first device ID: GPU-0340656d-d334-3b85-7d57-97eb4fe17a8d--0 I0716 06:19:20.310378 1 nvidia.go:79] # Add last device ID: GPU-0340656d-d334-3b85-7d57-97eb4fe17a8d--22918 I0716 06:19:20.310394 1 server.go:43] Device Map: map[GPU-7abbb7a9-40e1-053c-f605-e15bb57eae62:0 GPU-eaa0c1b3-6adc-a672-bf88-90837d41d729:1 GPU-a9fa46b3-ef3a-795a-f8bc-8963e56fb3d6:2 GPU-0340656d-d334-3b85-7d57-97eb4fe17a8d:3] I0716 06:19:20.310422 1 server.go:44] Device List: [GPU-7abbb7a9-40e1-053c-f605-e15bb57eae62 GPU-eaa0c1b3-6adc-a672-bf88-90837d41d729 GPU-a9fa46b3-ef3a-795a-f8bc-8963e56fb3d6 GPU-0340656d-d334-3b85-7d57-97eb4fe17a8d] I0716 06:19:20.342333 1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count I0716 06:19:20.342812 1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock I0716 06:19:20.343521 1 server.go:230] Registered device plugin with Kubelet
3.kubelet logs kubelet.pc-zjdbai51.root.log.INFO.20210709-193402.2247:I0716 14:19:19.114528 2247 kuberuntime_manager.go:404] No sandbox for pod "gpushare-device-plugin-ds-x525n_kube-system(54ea8b22-93f0-4214-81f0-aa4317f69b40)" can be found. Need to start a new one kubelet.pc-zjdbai51.root.log.INFO.20210709-193402.2247:I0716 14:19:20.094269 2247 kubelet.go:1933] SyncLoop (PLEG): "gpushare-device-plugin-ds-x525n_kube-system(54ea8b22-93f0-4214-81f0-aa4317f69b40)", event: &pleg.PodLifecycleEvent{ID:"54ea8b22-93f0-4214-81f0-aa4317f69b40", Type:"ContainerStarted", Data:"e8692ddb8e12010bbaed5dba9dc243208d407c7741dcb3d7b21b2137daf2171d"} kubelet.pc-zjdbai51.root.log.INFO.20210709-193402.2247:I0716 14:19:20.343282 2247 manager.go:363] Got registration request from device plugin with resource name "aliyun.com/gpu-mem" kubelet.pc-zjdbai51.root.log.INFO.20210709-193402.2247:I0716 14:19:20.343464 2247 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/device-plugins/aliyungpushare.sock 0}]
kubelet.pc-zjdbai51.root.log.INFO.20210709-193402.2247:I0716 14:19:20.343652 2247 manager.go:418] Registered endpoint &{0xc0009a60d8 0xc001c96300 /var/lib/kubelet/device-plugins/aliyungpushare.sock aliyun.com/gpu-mem {0 0 } {0 0} 0x18652a0}
kubelet.pc-zjdbai51.root.log.INFO.20210709-193402.2247:E0716 14:19:20.357750 2247 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5547796 vs. 4194304)
kubelet.pc-zjdbai51.root.log.INFO.20210709-193402.2247:I0716 14:19:20.357786 2247 manager.go:448] Mark all resources Unhealthy for resource aliyun.com/gpu-mem
kubelet.pc-zjdbai51.root.log.INFO.20210709-193402.2247:I0716 14:19:20.357791 2247 manager.go:432] Endpoint (aliyun.com/gpu-mem, &{0xc0009a60d8 0xc001c96300 /var/lib/kubelet/device-plugins/aliyungpushare.sock aliyun.com/gpu-mem {13849808115795710226 585918067123247 0x7814720} {0 0} 0x18652a0}) became unhealthy
kubelet.pc-zjdbai51.root.log.INFO.20210709-193402.2247:I0716 14:19:21.105598 2247 kubelet.go:1933] SyncLoop (PLEG): "gpushare-device-plugin-ds-x525n_kube-system(54ea8b22-93f0-4214-81f0-aa4317f69b40)", event: &pleg.PodLifecycleEvent{ID:"54ea8b22-93f0-4214-81f0-aa4317f69b40", Type:"ContainerStarted", Data:"31022baa09b663cd5400c6683d29ec42fe919851bcb1f82a345069cbcd430884"}
kubelet.pc-zjdbai51.root.log.WARNING.20210709-193405.2247:E0713 19:04:12.101894 2247 endpoint.go:106] listAndWatch ended unexpectedly for device plugin nvidia.com/gpu with error rpc error: code = Unavailable desc = transport is closing
kubelet.pc-zjdbai51.root.log.WARNING.20210709-193405.2247:E0713 19:04:13.964878 2247 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5547796 vs. 4194304)
kubelet.pc-zjdbai51.root.log.WARNING.20210709-193405.2247:E0714 16:58:39.756383 2247 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5547796 vs. 4194304)
kubelet.pc-zjdbai51.root.log.WARNING.20210709-193405.2247:E0716 14:19:20.357750 2247 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5547796 vs. 4194304)
kubelet.WARNING:E0713 19:04:12.101894 2247 endpoint.go:106] listAndWatch ended unexpectedly for device plugin nvidia.com/gpu with error rpc error: code = Unavailable desc = transport is closing
kubelet.WARNING:E0713 19:04:13.964878 2247 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5547796 vs. 4194304)
kubelet.WARNING:E0714 16:58:39.756383 2247 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5547796 vs. 4194304)
kubelet.WARNING:E0716 14:19:20.357750 2247 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5547796 vs. 4194304)