AliyunContainerService / gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Apache License 2.0
468 stars 144 forks source link

集群内pod数量过多的情况有可能会引起集群高负载从而雪崩,另外MiB单位有可能会引起kubelet grpc单位失败 #39

Closed qmloong closed 3 years ago

qmloong commented 3 years ago
  1. 在podmanager中会有list全量pod的操作,如果集群内pod数量过多(2w以上),并扩容大量使用gpu资源的pod时,测试0-1000,就会触发集群的list apiserver qps 10以上,引发集群雪崩 image image

  2. 在单位为MiB的时候,设备gpumem在124GB的时候,单位为MiB,所以fake device id会有12400,测试发现kubelet在listAndWatch的gRPC调用时,返回错误,修改命名的字符串凭接可以缓解

Jun 24 18:55:09 10-12-3-162 kubelet[350652]: E0624 18:55:09.869624  350652 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (7880680 vs. 4194304)