AliyunContainerService / gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Apache License 2.0
468 stars 144 forks source link

device plugin failed to detect gpu info correctly #17

Closed pan87232494 closed 4 years ago

pan87232494 commented 5 years ago

Description


  kubectl inspect gpushare
  NAME             IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU Memory(GiB)
  k8s-demo-slave2  192.168.2.140  0/1                    0/1                    0/2
  --------------------------------------------------------------
  Allocated/Total GPU Memory In Cluster:
  0/2 (0%)  

实际上这个主机有两个显卡, 显卡数量不对吧, 不能用gtx 1080ti?

```bash
nvidia-smi 
Thu Oct 10 15:03:38 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960     Off  | 00000000:17:00.0 Off |                  N/A |
| 36%   29C    P8     7W / 120W |      0MiB /  2002MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:66:00.0 Off |                  N/A |
| 14%   37C    P8    25W / 270W |      0MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

```
juchaosong commented 4 years ago

gpu0 gpu1显卡数量是正确的

pan87232494 commented 4 years ago

gpu0 gpu1显卡数量是正确的

但是显卡内存可以使用的数量是不对的吧, 现在不支持不同显卡混插是么? 比如现在下面的情况 两台机器, 140 GTX960+1080TI, 229 1080TI*2, 后面的显示的可用gpu是正确的. 但是混插的那台, 之显示了第一块卡的信息. kubectl inspect gpushare NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU Memory(GiB) 192.168.2.140 192.168.2.140 0/1 0/1 0/2 192.168.2.229 192.168.2.229 0/10 0/10 0/20

pan87232494 commented 4 years ago

gpu0 gpu1显卡数量是正确的

而且 1080Ti 应该是11G? 看到这里显示是10G了

pan87232494 commented 4 years ago

我把960 拆掉, 现在gpu 显示正确了 kubectl inspect gpushare NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU Memory(GiB) 192.168.2.140 192.168.2.140 0/10 0/0 0/10 192.168.2.229 192.168.2.229 10/10 10/10 20/20

juchaosong commented 4 years ago

gpu0 gpu1显卡数量是正确的

但是显卡内存可以使用的数量是不对的吧, 现在不支持不同显卡混插是么? 比如现在下面的情况 两台机器, 140 GTX960+1080TI, 229 1080TI*2, 后面的显示的可用gpu是正确的. 但是混插的那台, 之显示了第一块卡的信息. kubectl inspect gpushare NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU Memory(GiB) 192.168.2.140 192.168.2.140 0/1 0/1 0/2 192.168.2.229 192.168.2.229 0/10 0/10 0/20

https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/pkg/gpu/nvidia/nvidia.go#L70 从代码里看现在是不支持不同类型卡混插的

juchaosong commented 4 years ago

gpu0 gpu1显卡数量是正确的

而且 1080Ti 应该是11G? 看到这里显示是10G了

gpu信息获取是https://github.com/NVIDIA/gpu-monitoring-tools 这个代码库

pan87232494 commented 4 years ago

gpu0 gpu1显卡数量是正确的

而且 1080Ti 应该是11G? 看到这里显示是10G了

gpu信息获取是https://github.com/NVIDIA/gpu-monitoring-tools这个代码库

了解了. 多谢 :D