Open VadimPoliakov opened 4 days ago
Hi, @VadimPoliakov
I am using A10 GPU 48 VRAM in run pod which is ample for the flux model it is running smoothly in jupyter notebook. But while deployment with fastapi I am getting issue of cuda out of memory issue.
This issue is with also for quantized model.
Any help would be appreciated.
Thanks!
cc @sayakpaul
Hi, @VadimPoliakov I am using A10 GPU 48 VRAM in run pod which is ample for the flux model it is running smoothly in jupyter notebook. But while deployment with fastapi I am getting issue of cuda out of memory issue. This issue is with also for quantized model. Any help would be appreciated. Thanks! cc @sayakpaul
Hi. I`m not sure. But it seems like problem with simultaneously proccessing more than 1 images. Try to use queues for that.
No ,the problem is with when you stage your deployment, instead of starting an API gives out cuda memory issue .
No ,the problem is with when you stage your deployment, instead of starting an API gives out cuda memory issue .
If you start on several workers, it means several times diffusers tries to put all models to GPU VRAM. Make the separate service not FastAPI with queues with no workers. And in your FastAPI service just use this service.
Thanks bro! For the help.
Reason of this issue in really big models, which are more than 60GB. So diffusers tries to put all of them to GPU VRAM. Now there are couple ways to fix it.
First one is to add this line of code to your script:
You will now be able start your scripts, bit it will be kinda slow.
Second way is to quantize your models. Here I write the examples of code for different ways of using with different models:
For this solutions we must to say thank you to @sayakpaul