AliyunContainerService / gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Apache License 2.0
468 stars 144 forks source link

ResourceExhausted desc = grpc: received message larger than max (4986010 vs. 4194304) #57

Open k0nstantinv opened 1 year ago

k0nstantinv commented 1 year ago

For such a GPU like NVIDIA A100 PCI-E 80GB it's not possible to update extended resource in Mb due to that error:

ResourceExhausted desc = grpc: received message larger than max (4986010 vs. 4194304)

device plugin can't update the node status and it leads to GPU node has zero gpu_memory capacity

Capacity:
aliyun.com/gpu_memory:         0
Nov 17 15:09:51 node02 kubelet[11218]: I1117 15:09:51.797475   11218 manager.go:440] Mark all resources Unhealthy for resource aliyun.com/gpu_memory