abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.16k stars 970 forks source link

Support LoRA hotswapping and multiple LoRAs at a time #1817

Open richdougherty opened 3 weeks ago

richdougherty commented 3 weeks ago

This is a PR to add support for loading and changing LoRA adapters at runtime as introduced into llama.cpp in https://github.com/ggerganov/llama.cpp/pull/8332 by @ngxson. Adding this support should allow things like loading a base model, then swapping adapters in and out to support different features and behaviours. This could be really useful in smaller environments where we might use smaller models but want to support a variety of capabilities. (This appears to be the approach taken by some commercial mobile device makers.)

The list of changes from upstream in https://github.com/ggerganov/llama.cpp/pull/8332 are:

  • Refactor lora API
  • Allow hot-swapping lora
  • Added struct llama_lora_adapter to keep track of loaded lora

This PR is just a draft to show what I'm working on and get some feedback on the API, approach, etc. I do plan on tidying it up, squashing commits, and going through all the different bits of code and check they all work. If there's anything you'd like me to do please let me know!

For now I have got working something like this:

# Basing off some of the models tested here:
# https://github.com/predibase/lora_bakeoff
model_file_path = '.../mistral-7b-v0.1.Q4_K_S.gguf'
adapter_file_paths = [
    '.../magicoder-lora-mistral-7b-v0.1.gguf',
    '.../conllpp-lora-mistral-7b-v0.1.gguf',
]

llm = llama_cpp.Llama(
    model_path=model_file_path,
    lora_adapters=dict(map(lambda x: (x, 0.0), adapter_file_paths)),
)
for adapter_file_path in adapter_file_paths:
    # Clear adapters
    for lora_path in adapter_file_paths:
        llm.set_lora_adapter_scale(lora_path, 0)
    # Set only one adapter
    llm.set_lora_adapter_scale(adapter_file_path, 1.0)

    completion = llm.create_completion(
        seed=42,
        temperature=0,
        **task
    )
    print(completion['choices'][0]['text'])

Tasks:

richdougherty commented 2 weeks ago

Still working on this. Just added support to the OpenAI-compatible server for hot-swapping LoRAs via model aliases. This allows fast serving of different LoRA adapters that extend the same base model with minimal switching overhead.

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
          "model_alias": "mistral",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "verbose": true
        },
        {
          "model_alias": "mistral-magicoder",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./magicoder-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        },
        {
          "model_alias": "mistral-conllpp",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./conllpp-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        }
    ]
}

Then calling the OpenAI compatible API with "model": "mistral, "model": "mistral-magicoder, "model": "mistral-conllpp" will result in a hot-swap, e.g

Hot-swapping model, setting existing LoRA adapter scales to 0.0.
Hot-swapping model, setting LoRA adapter scales for mistral-conllpp.
llama_lora_adapter_init_internal: loading lora adapter from './conllpp-lora-mistral-7b-v0.1.gguf' ...
llama_lora_adapter_init_internal: CPU_Mapped LoRA buffer size =    13.00 MiB
llama_lora_adapter_init_internal: loaded 128 tensors from lora file
hrsmanian commented 4 days ago

This seems to be a cool feature to have. Any idea when this will be available?

richdougherty commented 3 days ago

The code is pretty much done and working. I plan to tidy it up a little this weekend, ready for review and (hopefully) merge.

hrsmanian commented 11 hours ago

Thanks Rich. Let me know when I can try it out.