Open richdougherty opened 3 weeks ago
Still working on this. Just added support to the OpenAI-compatible server for hot-swapping LoRAs via model aliases. This allows fast serving of different LoRA adapters that extend the same base model with minimal switching overhead.
{
"host": "0.0.0.0",
"port": 8080,
"models": [
{
"model_alias": "mistral",
"model": "./mistral-7b-v0.1.Q4_K_S.gguf",
"verbose": true
},
{
"model_alias": "mistral-magicoder",
"model": "./mistral-7b-v0.1.Q4_K_S.gguf",
"lora_adapters": {
"./magicoder-lora-mistral-7b-v0.1.gguf": 1.0
},
"verbose": true
},
{
"model_alias": "mistral-conllpp",
"model": "./mistral-7b-v0.1.Q4_K_S.gguf",
"lora_adapters": {
"./conllpp-lora-mistral-7b-v0.1.gguf": 1.0
},
"verbose": true
}
]
}
Then calling the OpenAI compatible API with "model": "mistral
, "model": "mistral-magicoder
, "model": "mistral-conllpp"
will result in a hot-swap, e.g
Hot-swapping model, setting existing LoRA adapter scales to 0.0.
Hot-swapping model, setting LoRA adapter scales for mistral-conllpp.
llama_lora_adapter_init_internal: loading lora adapter from './conllpp-lora-mistral-7b-v0.1.gguf' ...
llama_lora_adapter_init_internal: CPU_Mapped LoRA buffer size = 13.00 MiB
llama_lora_adapter_init_internal: loaded 128 tensors from lora file
This seems to be a cool feature to have. Any idea when this will be available?
The code is pretty much done and working. I plan to tidy it up a little this weekend, ready for review and (hopefully) merge.
Thanks Rich. Let me know when I can try it out.
This is a PR to add support for loading and changing LoRA adapters at runtime as introduced into llama.cpp in https://github.com/ggerganov/llama.cpp/pull/8332 by @ngxson. Adding this support should allow things like loading a base model, then swapping adapters in and out to support different features and behaviours. This could be really useful in smaller environments where we might use smaller models but want to support a variety of capabilities. (This appears to be the approach taken by some commercial mobile device makers.)
The list of changes from upstream in https://github.com/ggerganov/llama.cpp/pull/8332 are:
This PR is just a draft to show what I'm working on and get some feedback on the API, approach, etc. I do plan on tidying it up, squashing commits, and going through all the different bits of code and check they all work. If there's anything you'd like me to do please let me know!
For now I have got working something like this:
Tasks:
LlamaLoraAdapter
class, methods inLlamaContext
Llama
- newlora_adapters
param andset_lora_adapter_scaling
method--lora
, remove--lora-base
, add--lora-scaled