Open alecyan1993 opened 9 months ago
Hi, We are currently using stable-fast in our AI generation service. However, we have many SDXL models in the server. We encountered some cuda OOM issues if we switch the compiled models for different tasks for a few times. Before stable-fast included, our way to do it is inference with one model and move the model to cpu after the run. However, after the stable-fast implementation, this cleanup process will leave some extra memorty in the cuda. Can you explain a bit the reason and how to solve this issue? Big thanks!
I suggest you switching model weights instead of recompiling them if you need to switch models very frequently. Or you could have a multi-process design to avoid remaining resources occupancy after cleanup. The remaining memory might be caused by the tracing or freezing process of torchscript
.
Thanks so much for your quick reply! Do you have any examples of switching model weights instead of recompiling? Yes we do need to switch models very frequently on our platform.
Thanks so much for your quick reply! Do you have any examples of switching model weights instead of recompiling? Yes we do need to switch models very frequently on our platform.
You can check the LoRA
switching example in the doc to see how to switch weights.
sure! thanks again!
Hi, We are currently using stable-fast in our AI generation service. However, we have many SDXL models in the server. We encountered some cuda OOM issues if we switch the compiled models for different tasks for a few times. Before stable-fast included, our way to do it is inference with one model and move the model to cpu after the run. However, after the stable-fast implementation, this cleanup process will leave some extra memorty in the cuda. Can you explain a bit the reason and how to solve this issue? Big thanks!