AliyunContainerService / gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Apache License 2.0
468 stars 144 forks source link

节点重启后,发现gpu显存超分了 #35

Open zlingqu opened 3 years ago

zlingqu commented 3 years ago

当我重启GPU节点后,又发布了几个服务,发现某些卡的gpu显存超分了,效果如下:

[root@jenkins app-deploy-platform]# kubectl-inspect-gpushare 
NAME           IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU4(Allocated/Total)  GPU5(Allocated/Total)  GPU6(Allocated/Total)  GPU7(Allocated/Total)  GPU Memory(GiB)
192.168.3.4    192.168.3.4    18/11                  8/11                   9/11                   11/11                  17/11                  8/11                   8/11                   4/11                   83/88
192.168.68.4   192.168.68.4   14/10                  10/10                  6/10                   14/10                  10/10                  10/10                  9/10                   0/10                   73/80
192.168.68.68  192.168.68.68  9/10                   8/10                   4/10                   0/10                   0/10                   0/10                   0/10                   0/10                   21/80
---------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
177/248 (71%)  

我想这是插件本身有些bug