Open Qualzz opened 3 months ago
Hi @Qualzz! Thank you for suggesting this.
This idea falls under the topic of model hot-swapping - which we already have in some form with our Dynamic LoRA Adapter Activation feature. However, unloading and then loading a model is much more powerful - and best of all - is probably relatively easy to implement in the APIs. This can be done shortly, I will keep you updated here.
I think that changing the loaded model would be better implemented for your use case as killing the mistral.rs process and the restarting it with a different model. The reason is that otherwise, there would be a lot of code bloat for what is otherwise a rare situation. Is this not an option?
@EricLBuehler
How would that scenario work in practice?
I have a backend that, upon receiving a request, makes a call to MistralRS. If I have two users making requests within a few seconds of each other, with the current implementation, I can keep the Mistral process running, and it will work ( if the requested model is the same), but the model will remain loaded in VRAM indefinitely.
If I terminate the process after each request, it becomes complicated as it might interfere with the second query. Additionally, I don't want to load two instances of Mistral simultaneously.
Maybe I just need to create some kind of proxy that act as a queue system.
@Qualzz my idea is that we can hot-swap the Pipeline (the actual model instance) at runtime similar to how we do dynamic adapter activation. That is, a request comes in for this.
Would that work for you?
Sorry for not getting back - this is very interesting - I have been working on PagedAttention, and we are now outperforming llama.cpp in GGUF :wink: (CUDA), plus you get all the benefits of PagedAttention.
Hello! I had a thought. To minimize constant load for tasks that occur infrequently, is there a way to keep the Docker container running with the HTTP server, but only load the model when a query is made and then unload it from memory after a user-defined period? The default could be 5 minutes or indefinitely.
For example, I may have a task that requires a call to the MistralRS HTTP server and then triggers a Stable Diffusion generation. In that case, I'll need the VRAM to be available without shutting down and handling the mistral.rs docker instance.
Another example: I have a Discord bot that performs various tasks, like regular chat with Gemma2 9B (because it's fluent in French). However, the bot may also need to look up information in a large document using RAG. For that, It switch to another model with a larger context window (but less fluent in French) to answer the query and then switch back to Gemma 2 again to rephrase it with its own persona and bot context. I do a lot of model hot-swapping in a constrained VRAM environment.
One amazing feature would be the ability to set a keep_alive (in seconds) parameter, similar to setting the temperature in a query. Mistral would use that variable to free up the memory after the specified amount of time.
In my case, for large model I would set it to 0, meaning immediate unload once the query is done, while for smaller models, I would use a certain period of time to avoid reloading it during a chat session.
I don't have much experience with the potential complications or unintended side effects this feature might bring, but I believe it could be highly beneficial.