AliyunContainerService / gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Apache License 2.0
468 stars 144 forks source link

修复大量list all-namespaces pods的缺陷、MiB单位下名称长度可能引起grpc调用失败从而导致node的gpumem资源清0 #40

Closed qmloong closed 3 years ago

qmloong commented 3 years ago
  1. 把从request apiserver获取当前节点的pending pods改成请求kubelet的/pods,重试8次,每次backoff 0.1s,如果还是没有请求到才会去list3次apiserver,间隔1s,作为兜底
  2. 修改FakeDeviceID的命名逻辑,从之前6000Mib扩大3倍
qmloong commented 3 years ago

Hi @qmloong nice work! Could you split this pr into two? Let's fix list pods issue first.

好,那我关掉这个issure了