NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.65k stars 605 forks source link

can not distinguish t4 and a100 ? #378

Open ggjjlldd opened 1 year ago

ggjjlldd commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

My machine have 5 nvidia gpu. like this, contain t4 card and a100 card. But this device cat not distinguish t4 and a100. all card be marked nvidia.com/gpu. I want t4 be marked by nvidia.com/t4 and a100 be marked by nvidia.com/a100. How can I do?

Capacity:
  cpu:                            64
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              459403376Ki
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         263855916Ki
  nvidia.com/gpu:                 4
  nvidia.com/hostdev:             0
  nvidia.com/mig-1g.5gb:          7
  nvidia.com/mig-2g.10gb:         0
  nvidia.com/mig-4g.20gb:         0
  pods:                           110
root@k8s-gpuworker01:/var/lib/kubelet/device-plugins# nvidia-smi
Tue Feb  7 14:19:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:1A:00.0 Off |                    0 |
| N/A   32C    P8    14W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  On   | 00000000:3D:00.0 Off |                   On |
| N/A   32C    P0    66W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCI...  On   | 00000000:3E:00.0 Off |                    0 |
| N/A   30C    P0    54W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCI...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   28C    P0    40W / 250W |  40229MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCI...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   28C    P0    61W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  1    7   0   0  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    8   0   1  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    9   0   2  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   11   0   3  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   12   0   4  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   13   0   5  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   14   0   6  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    3   N/A  N/A   3135236      C   ...conda3/envs/dl/bin/python    40227MiB |
+-----------------------------------------------------------------------------+

2. Steps to reproduce the issue

install k8s device plugin kubectl describe node

3. Information to attach (optional if deemed irrelevant)

Common error checking:

Additional information that might help better understand your environment and reproduce the bug:

klueska commented 1 year ago

We built a feature last summer to do exactly what you describe. It is feature complete, but currently disabled in the plugin awaiting approval from our product team. It is unclear when or if it will ever be approved.

Here is a description of the feature: https://docs.google.com/document/d/1dL67t9IqKC2-xqonMi6DV7W2YNZdkmfX7ibB6Jb-qmk/edit

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

mkarami2024 commented 4 months ago

I also really need this feature!