Support LoRA hotswapping and multiple LoRAs at a time

richdougherty commented 3 weeks ago

This is a PR to add support for loading and changing LoRA adapters at runtime as introduced into llama.cpp in https://github.com/ggerganov/llama.cpp/pull/8332 by @ngxson. Adding this support should allow things like loading a base model, then swapping adapters in and out to support different features and behaviours. This could be really useful in smaller environments where we might use smaller models but want to support a variety of capabilities. (This appears to be the approach taken by some commercial mobile device makers.)

The list of changes from upstream in https://github.com/ggerganov/llama.cpp/pull/8332 are:

Refactor lora API

Allow hot-swapping lora

Added struct llama_lora_adapter to keep track of loaded lora

This PR is just a draft to show what I'm working on and get some feedback on the API, approach, etc. I do plan on tidying it up, squashing commits, and going through all the different bits of code and check they all work. If there's anything you'd like me to do please let me know!

For now I have got working something like this:

# Basing off some of the models tested here:
# https://github.com/predibase/lora_bakeoff
model_file_path = '.../mistral-7b-v0.1.Q4_K_S.gguf'
adapter_file_paths = [
    '.../magicoder-lora-mistral-7b-v0.1.gguf',
    '.../conllpp-lora-mistral-7b-v0.1.gguf',
]

llm = llama_cpp.Llama(
    model_path=model_file_path,
    lora_adapters=dict(map(lambda x: (x, 0.0), adapter_file_paths)),
)
for adapter_file_path in adapter_file_paths:
    # Clear adapters
    for lora_path in adapter_file_paths:
        llm.set_lora_adapter_scale(lora_path, 0)
    # Set only one adapter
    llm.set_lora_adapter_scale(adapter_file_path, 1.0)

    completion = llm.create_completion(
        seed=42,
        temperature=0,
        **task
    )
    print(completion['choices'][0]['text'])

Tasks:

[x] Basic low-level support - new LlamaLoraAdapter class, methods in LlamaContext
[x] Updated to new APIs
[x] Support loading multiple LoRAs and runtime hot-swapping in Llama - new lora_adapters param and set_lora_adapter_scaling method
[x] Updated command line args to match upstream - support multiple --lora, remove --lora-base, add --lora-scaled
[x] Test prefix caching with swapped LoRAs - prefix only applies with
[x] Test cache with swapped LoRAs - possible code added, not tested
[ ] Test disk cache
[ ] Test state saving
[x] Test/update server configuration
[x] Test low level chat API (existing code doesn't work so can't test properly but executes past LoRA setting)
[ ] General clean up and consistency
[x] Support for selecting LoRAs via server endpoints (maybe later PR)
[ ] Squash commits

richdougherty commented 2 weeks ago

Still working on this. Just added support to the OpenAI-compatible server for hot-swapping LoRAs via model aliases. This allows fast serving of different LoRA adapters that extend the same base model with minimal switching overhead.

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
          "model_alias": "mistral",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "verbose": true
        },
        {
          "model_alias": "mistral-magicoder",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./magicoder-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        },
        {
          "model_alias": "mistral-conllpp",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./conllpp-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        }
    ]
}

Then calling the OpenAI compatible API with "model": "mistral, "model": "mistral-magicoder, "model": "mistral-conllpp" will result in a hot-swap, e.g

Hot-swapping model, setting existing LoRA adapter scales to 0.0.
Hot-swapping model, setting LoRA adapter scales for mistral-conllpp.
llama_lora_adapter_init_internal: loading lora adapter from './conllpp-lora-mistral-7b-v0.1.gguf' ...
llama_lora_adapter_init_internal: CPU_Mapped LoRA buffer size =    13.00 MiB
llama_lora_adapter_init_internal: loaded 128 tensors from lora file

hrsmanian commented 4 days ago

This seems to be a cool feature to have. Any idea when this will be available?

richdougherty commented 3 days ago

The code is pretty much done and working. I plan to tidy it up a little this weekend, ready for review and (hopefully) merge.

hrsmanian commented 11 hours ago

Thanks Rich. Let me know when I can try it out.

abetlen / llama-cpp-python

Support LoRA hotswapping and multiple LoRAs at a time #1817