NexaAI / nexa-sdk

Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.
https://docs.nexa.ai/
Apache License 2.0
4.47k stars 659 forks source link

[BUG] FLUX model components fail to utilize available GPU memory (6.8GB/16GB used) with t5xxl/vae/clip falling back to CPU #199

Open WpythonW opened 1 month ago

WpythonW commented 1 month ago

Issue Description

I'm experiencing inefficient GPU utilization with FLUX model components. While the main FLUX model uses GPU (6.8GB VRAM), other components (t5xxl-q4_0, ae-fp16, clip_l-fp16) seem to run on CPU despite having 9.2GB of free GPU memory available.

Expected behavior: All model components should utilize available GPU memory for optimal performance. Actual behavior:

Steps to Reproduce

  1. Install Nexa SDK with CUDA support:

    !CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir
  2. Run the following code:

    from nexa.gguf import NexaImageInference
    model_path = "FLUX.1-schnell:q4_0"
    inference = NexaImageInference(
    model_path=model_path,
    wtype="q4_0",
    num_inference_steps=5,
    width=1024,
    height=1024,
    guidance_scale=1.5,
    random_seed=42
    )
    img = inference.txt2img("A sunset over a mountain range")

OS

Linux (Kaggle environment)

Python Version

Version: 3.10

Nexa SDK Version

nexaai-0.0.9.0

GPU (if using one)

NVIDIA Tesla P100 (16GB VRAM)

zhiyuan8 commented 4 weeks ago

Thank you for reaching out with your request!

We are actively addressing issues related to inefficient GPU utilization for FLUX model components. We are trying to offload more FLUX components to GPU and will fix this soon.