EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
3.6k stars 254 forks source link

[FEATURE REQUEST] Dynamic Model Loading and Unloading for Efficient VRAM Management #545

Open Qualzz opened 3 months ago

Qualzz commented 3 months ago

Hello! I had a thought. To minimize constant load for tasks that occur infrequently, is there a way to keep the Docker container running with the HTTP server, but only load the model when a query is made and then unload it from memory after a user-defined period? The default could be 5 minutes or indefinitely.

For example, I may have a task that requires a call to the MistralRS HTTP server and then triggers a Stable Diffusion generation. In that case, I'll need the VRAM to be available without shutting down and handling the mistral.rs docker instance.

Another example: I have a Discord bot that performs various tasks, like regular chat with Gemma2 9B (because it's fluent in French). However, the bot may also need to look up information in a large document using RAG. For that, It switch to another model with a larger context window (but less fluent in French) to answer the query and then switch back to Gemma 2 again to rephrase it with its own persona and bot context. I do a lot of model hot-swapping in a constrained VRAM environment.

One amazing feature would be the ability to set a keep_alive (in seconds) parameter, similar to setting the temperature in a query. Mistral would use that variable to free up the memory after the specified amount of time.

In my case, for large model I would set it to 0, meaning immediate unload once the query is done, while for smaller models, I would use a certain period of time to avoid reloading it during a chat session.

I don't have much experience with the potential complications or unintended side effects this feature might bring, but I believe it could be highly beneficial.

EricLBuehler commented 3 months ago

Hi @Qualzz! Thank you for suggesting this.

This idea falls under the topic of model hot-swapping - which we already have in some form with our Dynamic LoRA Adapter Activation feature. However, unloading and then loading a model is much more powerful - and best of all - is probably relatively easy to implement in the APIs. This can be done shortly, I will keep you updated here.

I think that changing the loaded model would be better implemented for your use case as killing the mistral.rs process and the restarting it with a different model. The reason is that otherwise, there would be a lot of code bloat for what is otherwise a rare situation. Is this not an option?

Qualzz commented 2 months ago

@EricLBuehler

How would that scenario work in practice?

I have a backend that, upon receiving a request, makes a call to MistralRS. If I have two users making requests within a few seconds of each other, with the current implementation, I can keep the Mistral process running, and it will work ( if the requested model is the same), but the model will remain loaded in VRAM indefinitely.

If I terminate the process after each request, it becomes complicated as it might interfere with the second query. Additionally, I don't want to load two instances of Mistral simultaneously.

Maybe I just need to create some kind of proxy that act as a queue system.

EricLBuehler commented 2 months ago

@Qualzz my idea is that we can hot-swap the Pipeline (the actual model instance) at runtime similar to how we do dynamic adapter activation. That is, a request comes in for this.

Would that work for you?

Sorry for not getting back - this is very interesting - I have been working on PagedAttention, and we are now outperforming llama.cpp in GGUF :wink: (CUDA), plus you get all the benefits of PagedAttention.