VectorSpaceLab / OmniGen

OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340
MIT License
2.89k stars 228 forks source link

Slow when using CUDA on 2nd GPU with Pinokio #98

Open TypQxQ opened 3 weeks ago

TypQxQ commented 3 weeks ago

I tried running through the manual installer in a conda env but I get all kinds of errors with CUDA and numpy. I'm just not that experienced with this 😞 .

Running with Pinokio at least starts the app and runs but verry slow. This is generating the first demo prompt: image I'm not sure it is even using CUDA although it uses VRAM because the 3D engine was at 0% all the time.

Tried the second demo with 15 Inference steps and got this times:

Microsoft Windows [Version 10.0.26120.2200]
(c) Microsoft Corporation. All rights reserved.

D:\pinokio\api\omnigen.git\app>conda_hook && conda deactivate && conda deactivate && conda deactivate && conda activate base && D:\pinokio\api\omnigen.git\app\env\Scripts\activate D:\pinokio\api\omnigen.git\app\env && python app.py

Fetching 10 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<?, ?it/s]
Loading safetensors
* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

D:\pinokio\api\omnigen.git\app\env\lib\site-packages\diffusers\models\attention_processor.py:2358: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = F.scaled_dot_product_attention(

  0%|                                                                                                                                                                                                                             | 0/15 [00:00<?, ?it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [3:19:56<00:00, 799.79s/it]

I know my system is slow compared to others but should I give up?

TypQxQ commented 3 weeks ago

It also can't alocate memory on subsequent runs even after restarting the app. After getting the error it still uses the memory until I stop the app but then when starting it can't alocate it anyway. image

TypQxQ commented 3 weeks ago

Running same demo prompt but with CUDA on GPU0 will show it running the 3D engine, but the fans aren't spining up. image This uses some memory by Windows, and I was hoping using it on second GPU would free som mem so I can run the second demo. I get out of memory error on second demo:

  File "D:\pinokio\api\omnigen.git\app\env\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\pinokio\api\omnigen.git\app\env\lib\site-packages\transformers\models\phi3\modeling_phi3.py", line 713, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.19 GiB. GPU 
staoxiao commented 3 weeks ago

The GPU speed should not be this slow. Please confirm whether it is running on the GPU, and check if you are using the latest code. If the GPU memory is insufficient, you can enable offload_model. For memory requirements, please refer to https://github.com/VectorSpaceLab/OmniGen/blob/main/docs/inference.md#requiremented-resources 10GB memory is enough for most cases. If you encounter issues during installation, for some solution you can refer to https://github.com/VectorSpaceLab/OmniGen/issues/73

TypQxQ commented 2 weeks ago

I deactivated my CPU graphics and tried again and it was worse. 491.58s/it. I installed it on a conda env and ran the extra commands like in #73 and got it working but worse than in Pinokio. I don't know how much it is running on the GPU. Windows is showing 100% 3D engine on GPU but the card is runing cold. Copy is 0% usage and I don' know if that's a problem as well as Vieo decode and encode but I guess those aren't used.