Question about multi-adapter

bentoml / OpenLLM

Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.

Apache License 2.0

10.06k stars 638 forks source link

Currently, we yet to support unloading lora layers. This has to do with unloading models of the memory are pretty slow from what I have tested so far, when loading around 10-15 layers

Another approach is not to disable lora layers when loading model into memory, and load dynamically on request. Imagine in a distributed environment, there is no way to ensure that all model pods will load the adapter correctly.

I think for multi adapters, the ability to use the base model can be supported, but I think it is probably very low priority right now.

bentoml / OpenLLM

Question about multi-adapter #656