Open k0nstantinv opened 1 year ago
For such a GPU like NVIDIA A100 PCI-E 80GB it's not possible to update extended resource in Mb due to that error:
ResourceExhausted desc = grpc: received message larger than max (4986010 vs. 4194304)
device plugin can't update the node status and it leads to GPU node has zero gpu_memory capacity
gpu_memory
Capacity: aliyun.com/gpu_memory: 0
Nov 17 15:09:51 node02 kubelet[11218]: I1117 15:09:51.797475 11218 manager.go:440] Mark all resources Unhealthy for resource aliyun.com/gpu_memory
For such a GPU like NVIDIA A100 PCI-E 80GB it's not possible to update extended resource in Mb due to that error:
ResourceExhausted desc = grpc: received message larger than max (4986010 vs. 4194304)
device plugin can't update the node status and it leads to GPU node has zero
gpu_memory
capacity