NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.85k stars 297 forks source link

Wrong node capacity and allocatable when using MIG #637

Open xhejtman opened 11 months ago

xhejtman commented 11 months ago

1. Quick Debug Information

2. Issue or feature description

When MIG is enabled, both MIG resource and nvidia.com/gpu resource are reported as allocatable:

Allocatable:
  cerit.io/gpu-count:      2
  cerit.io/gpu-mem:        0
  cpu:                     64
  ephemeral-storage:       7104643354787
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  519659388Ki
  nvidia.com/gpu:          2
  nvidia.com/mig-1g.10gb:  6
  nvidia.com/mig-2g.20gb:  4
  nvidia.com/mig-3g.40gb:  0
  pods:                    160

which means that both requests nvidia.com/gpu and nvidia.com/mig-1g.10gb can land on the node, however, the nvidia.com/gpu request fails to inject GPU.

3. Steps to reproduce the issue

Enable MIG on A100 GPU.

This may be just a bug in Kubernetes, not the gpu operator itself.

shivamerla commented 11 months ago

@xhejtman this is controlled by the mig.strategy: mixed parameter. When mixed strategy is used the device-plugin will

So in your case, you do seem to have some GPUs with MIG disabled and others with enabled. Is that correct? Otherwise this would be a bug.

xhejtman commented 11 months ago

I have both GPUs set into mig configuration:

Thu Dec 21 00:55:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:27:00.0 Off |                   On |
| N/A   50C    P0              83W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:A3:00.0 Off |                   On |
| N/A   48C    P0              81W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    3   0   0  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    5   0   1  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    9   0   2  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   10   0   3  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   13   0   4  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    3   0   0  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    5   0   1  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    9   0   2  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   10   0   3  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   13   0   4  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
shivamerla commented 11 months ago

Ah, this seems to be a bug then. Will look into this. cc @elezar @klueska

elezar commented 10 months ago

@xhejtman could you provide the logs from the device plugin?

xhejtman commented 10 months ago

2.log

In meantime, I checked that Kubernetes 1.27.8 is not a problem, I have different cluster with 23.6.1 operator and it works ok.

elezar commented 10 months ago

Looking at the logs, we're only starting 2 GRPC servers:

2023-12-18T12:40:20.600590354+01:00 stderr F I1218 11:40:20.600444       1 server.go:165] Starting GRPC server for 'nvidia.com/mig-1g.10gb'
2023-12-18T12:40:20.601080041+01:00 stderr F I1218 11:40:20.600967       1 server.go:117] Starting to serve 'nvidia.com/mig-1g.10gb' on /var/lib/kubelet/device-plugins/nvidia-mig-1g.10gb.sock
2023-12-18T12:40:20.633441912+01:00 stderr F I1218 11:40:20.632289       1 server.go:125] Registered device plugin for 'nvidia.com/mig-1g.10gb' with Kubelet
2023-12-18T12:40:20.633473571+01:00 stderr F I1218 11:40:20.632494       1 server.go:165] Starting GRPC server for 'nvidia.com/mig-2g.20gb'
2023-12-18T12:40:20.633492757+01:00 stderr F I1218 11:40:20.632946       1 server.go:117] Starting to serve 'nvidia.com/mig-2g.20gb' on /var/lib/kubelet/device-plugins/nvidia-mig-2g.20gb.sock
2023-12-18T12:40:20.649231279+01:00 stderr F I1218 11:40:20.644793       1 server.go:125] Registered device plugin for 'nvidia.com/mig-2g.20gb' with Kubelet

meaning that the running instance of the plugin should only be exposing these as allocatable resources.

Could you confirm that /var/lib/kubelet/device-plugins/ only references these two resource types? It could be that when applying the MIG config update the other socket was not removed.

xhejtman commented 10 months ago
root@kub-as6:/var/lib/kubelet/device-plugins# ls -1
kubelet.sock
kubelet_internal_checkpoint
nvidia-mig-1g.10gb.sock
nvidia-mig-2g.20gb.sock
root@kub-as6:/var/lib/kubelet/device-plugins#