NVidia GPU not correctly detected on Kubernetes

Hello,

I deployed StableSwarmUI to a Kubernetes cluster, which has MIG-enabled Nvidia GPUs. The GPUs are not recognized during setup though. The root cause seems to lie in QueryNvidia(), which executes nvidia-smi:

https://github.com/Stability-AI/StableSwarmUI/blob/c96e6d43e68ea54281cf3286a31b17eab476fa09/src/Utils/NvidiaUtil.cs#L57

When executed in the pod this returns:

# nvidia-smi --query-gpu=gpu_name,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv
name, driver_version, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
NVIDIA A100-SXM4-80GB, 535.161.08, 47, [N/A], [N/A], [Insufficient Permissions], [Insufficient Permissions], [Insufficient Permissions]

There are two problems:

Querying utilization.* is not supported on MIG enabled GPUs, cf. nvidia-smi --help-query-gpu, which is why parsing fails.

For some reason, querying memory.* fails due to lacking permissions (the container runs unprivileged). Executing a plain nvidia-smi command however prints the available memory - not sure what's wrong here:


Thu Jun 13 20:19:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:87:00.0 Off |                   On |
| N/A   52C    P0             188W / 400W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | 0 6 0 0 | 94MiB / 19968MiB | 14 0 | 1 0 1 0 0 | | | 2MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 615 C /dlbackend/ComfyUI/venv/bin/python3 0MiB | +---------------------------------------------------------------------------------------+



Though diffusions run correctly on the GPU, you might want to consider changing that logic a bit.

Stability-AI / StableSwarmUI

NVidia GPU not correctly detected on Kubernetes #388