AliyunContainerService / gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Apache License 2.0
468 stars 144 forks source link

No Devices found. Waiting indefinitely. #49

Open clennpillo opened 2 years ago

clennpillo commented 2 years ago

Plugin cannot find my A100 80G

I use Rancher 2.5.9 to build my cluster, I think the installation steps are correct since it worked on another cluster which I use A100 40G, however, it fails on this cluster using A100 80G.

nvidia-smi gives the correct result.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:00:08.0 Off |                    0 |
| N/A   39C    P0    60W / 300W |      0MiB / 80994MiB |     14%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But no gpu in cluster

kubectl describe node

Allocatable:
  cpu:                2
  ephemeral-storage:  48294789041
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3777904Ki
  pods:               110

I tried to find the reason, this is the log of the Pod for the plugin.

[root@data1 ~]# docker logs 6e8823f03d54
I0114 15:37:53.669065       1 main.go:18] Start gpushare device plugin
I0114 15:37:53.669146       1 gpumanager.go:28] Loading NVML
I0114 15:37:53.743358       1 gpumanager.go:37] Fetching devices.
I0114 15:37:53.743407       1 gpumanager.go:39] No devices found. Waiting indefinitely.
[root@data1 ~]#

Any idea how this happen ? Is that possible the plugin does not support A100 80G ?

Lanyujiex commented 2 years ago

You can try to upgrade nvidia-docker and try again