bentoml / OpenLLM

Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
10.06k stars 638 forks source link

Question about multi-adapter #656

Open KyrieCui opened 1 year ago

KyrieCui commented 1 year ago

I met a question when using multi-adapter. It works with loading different PEFT adapter and call it by the adapter_name/ adapter_id. However, can i call the Vanilla llm? For example, I deploy Llama2 with multi-adapters, can i disable adapters and using the original llam2 model to inference by the framework? Looking forward to u asap.

aarnphm commented 1 year ago

Currently, we yet to support unloading lora layers. This has to do with unloading models of the memory are pretty slow from what I have tested so far, when loading around 10-15 layers

Another approach is not to disable lora layers when loading model into memory, and load dynamically on request. Imagine in a distributed environment, there is no way to ensure that all model pods will load the adapter correctly.

I think for multi adapters, the ability to use the base model can be supported, but I think it is probably very low priority right now.