StableSwarmUI, A Modular Stable Diffusion Web-User-Interface, with an emphasis on making powertools easily accessible, high performance, and extensibility.
MIT License
4.12k
stars
333
forks
source link
NVidia GPU not correctly detected on Kubernetes #388
I deployed StableSwarmUI to a Kubernetes cluster, which has MIG-enabled Nvidia GPUs. The GPUs are not recognized during setup though. The root cause seems to lie in QueryNvidia(), which executes nvidia-smi:
Querying utilization.* is not supported on MIG enabled GPUs, cf. nvidia-smi --help-query-gpu, which is why parsing fails.
For some reason, querying memory.* fails due to lacking permissions (the container runs unprivileged). Executing a plain nvidia-smi command however prints the available memory - not sure what's wrong here:
Thu Jun 13 20:19:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:87:00.0 Off | On |
| N/A 52C P0 188W / 400W | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 6 0 0 | 94MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 615 C /dlbackend/ComfyUI/venv/bin/python3 0MiB |
+---------------------------------------------------------------------------------------+
Though diffusions run correctly on the GPU, you might want to consider changing that logic a bit.
Hello,
I deployed StableSwarmUI to a Kubernetes cluster, which has MIG-enabled Nvidia GPUs. The GPUs are not recognized during setup though. The root cause seems to lie in
QueryNvidia()
, which executesnvidia-smi
:https://github.com/Stability-AI/StableSwarmUI/blob/c96e6d43e68ea54281cf3286a31b17eab476fa09/src/Utils/NvidiaUtil.cs#L57
When executed in the pod this returns:
There are two problems:
utilization.*
is not supported on MIG enabled GPUs, cf.nvidia-smi --help-query-gpu
, which is why parsing fails.memory.*
fails due to lacking permissions (the container runs unprivileged). Executing a plainnvidia-smi
command however prints the available memory - not sure what's wrong here:+---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | 0 6 0 0 | 94MiB / 19968MiB | 14 0 | 1 0 1 0 0 | | | 2MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 615 C /dlbackend/ComfyUI/venv/bin/python3 0MiB | +---------------------------------------------------------------------------------------+