Open KyrieCui opened 1 year ago
Currently, we yet to support unloading lora layers. This has to do with unloading models of the memory are pretty slow from what I have tested so far, when loading around 10-15 layers
Another approach is not to disable lora layers when loading model into memory, and load dynamically on request. Imagine in a distributed environment, there is no way to ensure that all model pods will load the adapter correctly.
I think for multi adapters, the ability to use the base model can be supported, but I think it is probably very low priority right now.
I met a question when using multi-adapter. It works with loading different PEFT adapter and call it by the adapter_name/ adapter_id. However, can i call the Vanilla llm? For example, I deploy Llama2 with multi-adapters, can i disable adapters and using the original llam2 model to inference by the framework? Looking forward to u asap.