Open xhejtman opened 11 months ago
@xhejtman this is controlled by the mig.strategy: mixed
parameter. When mixed strategy is used the device-plugin will
So in your case, you do seem to have some GPUs with MIG disabled and others with enabled. Is that correct? Otherwise this would be a bug.
I have both GPUs set into mig configuration:
Thu Dec 21 00:55:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:27:00.0 Off | On |
| N/A 50C P0 83W / 300W | 38MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000000:A3:00.0 Off | On |
| N/A 48C P0 81W / 300W | 38MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 3 0 0 | 10MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 5 0 1 | 10MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 9 0 2 | 5MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 10 0 3 | 5MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 13 0 4 | 5MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 3 0 0 | 10MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 5 0 1 | 10MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 9 0 2 | 5MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 10 0 3 | 5MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 13 0 4 | 5MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Ah, this seems to be a bug then. Will look into this. cc @elezar @klueska
@xhejtman could you provide the logs from the device plugin?
In meantime, I checked that Kubernetes 1.27.8 is not a problem, I have different cluster with 23.6.1 operator and it works ok.
Looking at the logs, we're only starting 2 GRPC servers:
2023-12-18T12:40:20.600590354+01:00 stderr F I1218 11:40:20.600444 1 server.go:165] Starting GRPC server for 'nvidia.com/mig-1g.10gb'
2023-12-18T12:40:20.601080041+01:00 stderr F I1218 11:40:20.600967 1 server.go:117] Starting to serve 'nvidia.com/mig-1g.10gb' on /var/lib/kubelet/device-plugins/nvidia-mig-1g.10gb.sock
2023-12-18T12:40:20.633441912+01:00 stderr F I1218 11:40:20.632289 1 server.go:125] Registered device plugin for 'nvidia.com/mig-1g.10gb' with Kubelet
2023-12-18T12:40:20.633473571+01:00 stderr F I1218 11:40:20.632494 1 server.go:165] Starting GRPC server for 'nvidia.com/mig-2g.20gb'
2023-12-18T12:40:20.633492757+01:00 stderr F I1218 11:40:20.632946 1 server.go:117] Starting to serve 'nvidia.com/mig-2g.20gb' on /var/lib/kubelet/device-plugins/nvidia-mig-2g.20gb.sock
2023-12-18T12:40:20.649231279+01:00 stderr F I1218 11:40:20.644793 1 server.go:125] Registered device plugin for 'nvidia.com/mig-2g.20gb' with Kubelet
meaning that the running instance of the plugin should only be exposing these as allocatable resources.
Could you confirm that /var/lib/kubelet/device-plugins/
only references these two resource types? It could be that when applying the MIG config update the other socket was not removed.
root@kub-as6:/var/lib/kubelet/device-plugins# ls -1
kubelet.sock
kubelet_internal_checkpoint
nvidia-mig-1g.10gb.sock
nvidia-mig-2g.20gb.sock
root@kub-as6:/var/lib/kubelet/device-plugins#
1. Quick Debug Information
2. Issue or feature description
When MIG is enabled, both MIG resource and
nvidia.com/gpu
resource are reported as allocatable:which means that both requests
nvidia.com/gpu
andnvidia.com/mig-1g.10gb
can land on the node, however, thenvidia.com/gpu
request fails to inject GPU.3. Steps to reproduce the issue
Enable MIG on A100 GPU.
This may be just a bug in Kubernetes, not the gpu operator itself.