Stability-AI / StableSwarmUI

StableSwarmUI, A Modular Stable Diffusion Web-User-Interface, with an emphasis on making powertools easily accessible, high performance, and extensibility.
MIT License
4.12k stars 333 forks source link

NVidia GPU not correctly detected on Kubernetes #388

Closed derselbst closed 3 weeks ago

derselbst commented 3 weeks ago

Hello,

I deployed StableSwarmUI to a Kubernetes cluster, which has MIG-enabled Nvidia GPUs. The GPUs are not recognized during setup though. The root cause seems to lie in QueryNvidia(), which executes nvidia-smi:

https://github.com/Stability-AI/StableSwarmUI/blob/c96e6d43e68ea54281cf3286a31b17eab476fa09/src/Utils/NvidiaUtil.cs#L57

When executed in the pod this returns:

# nvidia-smi --query-gpu=gpu_name,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv
name, driver_version, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
NVIDIA A100-SXM4-80GB, 535.161.08, 47, [N/A], [N/A], [Insufficient Permissions], [Insufficient Permissions], [Insufficient Permissions]

There are two problems:

  1. Querying utilization.* is not supported on MIG enabled GPUs, cf. nvidia-smi --help-query-gpu, which is why parsing fails.
  2. For some reason, querying memory.* fails due to lacking permissions (the container runs unprivileged). Executing a plain nvidia-smi command however prints the available memory - not sure what's wrong here:
    
    Thu Jun 13 20:19:31 2024
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA A100-SXM4-80GB          On  | 00000000:87:00.0 Off |                   On |
    | N/A   52C    P0             188W / 400W |                  N/A |     N/A      Default |
    |                                         |                      |              Enabled |
    +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | 0 6 0 0 | 94MiB / 19968MiB | 14 0 | 1 0 1 0 0 | | | 2MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 615 C /dlbackend/ComfyUI/venv/bin/python3 0MiB | +---------------------------------------------------------------------------------------+



Though diffusions run correctly on the GPU, you might want to consider changing that logic a bit.
mcmonkey4eva commented 3 weeks ago

I pushed a commit which should interpret the refused values as a 0 and return successfully rather than erroring out