Bug: Difficulties Using LLaMa.cpp Server and --prompt-cache [FNAME] (not supported?)

darien-schettler commented 3 weeks ago

What happened?

As seen here:

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

The llama.cpp server should support --prompt-cache [FNAME]

I have not been able to get this feature to work. I have tried workarounds such as using llama-cli to generate the prompt-cache and then specify this file for llama-server.

Is there some minimally reproducible code snippet that shows this feature working? Is it implemented?

Thanks in advance.

Name and Version

CLI Call to generate prompt cache.

version: 3613 (fc54ef0d) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

$ ./llama-cli -m "/.../Meta-Llama-3.1-8B-Instruct-Q6_K.gguf" -c 4096 --verbose-prompt -co --mlock -t $(nproc) --prompt-cache "/.../prompt_cache/prompt_cache.bin --prompt-cache-all --file "/.../prompt_files/pirate_prompt.txt"

Server Call (after generating prompt_cache.bin with llama-cli)(this prompt file is the same as the above without the final user input which will be sent via the request).

version: 3613 (fc54ef0d) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

$ ./llama-server -m "/.../Meta-Llama-3.1-8B-Instruct-Q6_K.gguf" --host 0.0.0.0 --port 8080 -c 4096 --verbose-prompt -co --mlock -t $(nproc) --prompt-cache "/.../prompt_cache/prompt_cache.bin" --prompt-cache-ro --keep -1 -f "/.../prompt_files/pirate_prompt_server.txt"

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

ggerganov commented 3 weeks ago

The argument is ignored by llama-server. It would be nice to implement, but it's not very clear how since it has to consider multiple parallel slots. Or at the very least, assert that -np 1

ngxson commented 3 weeks ago

The --prompt-cache is not directly supported by server. You can use prompt caching with /slots endpoint. It allows you to save and load the KV cache for each slot.

See usage in the docs: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-slotsid_slotactionsave-save-the-prompt-cache-of-the-specified-slot-to-a-file

darien-schettler commented 3 weeks ago

Ah I see. So the documentation I linked to is just out-of-date (or it is correct in saying the argument is supported but the functionality is not)

I can update that if you’d like. Thank you for clarifying.

As a followup, on a powerful CPU only machine, if I have a pipeline of 3-4 steps (each requiring different system prompts and few shot examples (message history)), is there a way to cache all of those ahead of time?

This pipeline is run E2E as a single call by many users but it’s not a continuous conversation.

I don’t want to pay the prompt processing latency for each step (as it takes an inordinate amount of time even on a powerful machine. Ideally I have one model with the ability to determine dynamically which cached prompt is the best match and go from there.

If that’s not possible, do you have any suggestions for workarounds that minimize latency?

darien-schettler commented 3 weeks ago

The --prompt-cache is not directly supported by server. You can use prompt caching with /slots endpoint. It allows you to save and load the KV cache for each slot.

See usage in the docs: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-slotsid_slotactionsave-save-the-prompt-cache-of-the-specified-slot-to-a-file

I will experiment with this today and see if I can make this work. I will report back if you’d like?

ngxson commented 3 weeks ago

In this case, you don't even need to write the prompt cache to disk. You can use cache_prompt option:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant. You task is......."},
    {"role": "user", "content": "You are you"}
  ],
  "cache_prompt": true
}

The first time, all tokens will be processed and kept in cache (so it takes time).

The second time, it will reused cached tokens:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant. You task is......."},
    {"role": "user", "content": "This is another question"}
  ],
  "cache_prompt": true
}

This time, only This is another question will be processed. The system prompt is cached.

darien-schettler commented 3 weeks ago

In this case, you don't even need to write the prompt cache to disk. You can use cache_prompt option:
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant. You task is......."},
    {"role": "user", "content": "You are you"}
  ],
  "cache_prompt": true
}
The first time, all tokens will be processed and kept in cache (so it takes time).

The second time, it will reused cached tokens:
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant. You task is......."},
    {"role": "user", "content": "This is another question"}
  ],
  "cache_prompt": true
}
This time, only This is another question will be processed. The system prompt is cached.

I will experiment with this too. Thanks in advance.

If any of the solutions above fixes things I will close this issue with a comment detailing my experience.

Devbrat-Dev commented 1 week ago

The --prompt-cache is not directly supported by server. You can use prompt caching with /slots endpoint. It allows you to save and load the KV cache for each slot.

See usage in the docs: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-slotsid_slotactionsave-save-the-prompt-cache-of-the-specified-slot-to-a-file

This works for me, as my system only has a CPU, and processing longer prompts takes too much time.

I start the llama-server with the --slot-save-path PATH option to specify the path for saving the slot KV cache.

Before terminating the llama-server, I save the KV cache by sending an API request using curl:

curl --request POST \
  --url 'http://[IP]:[PORT]/slots/0?action=save' \
  --header 'Content-Type: application/json' \
  --data '{
    "filename": "cache.bin"
}'

To restore the KV cache when starting llama-server next time, send an API request using curl:

curl --request POST \
  --url 'http://[IP]:[PORT]/slots/0?action=restore' \
  --header 'Content-Type: application/json' \
  --data '{
    "filename": "cache.bin"
}'

darien-schettler commented 1 week ago

Hi all, I've been able to get the initial setup working by simply relying on cache prompt=True and triggering all 20 known prompts (for the various tools/endpoints).

@Devbrat-Dev - I will try this method you mentioned and report back. After that I will close this comment.

Thanks for all the support!

darien-schettler commented 1 week ago

EDIT: It appears I needed to create the cache myself first. I found the error in the logging output (missed it initially due to the verbosity flag I put through overwhelming it). To fix I just did touch [PATH TO INCLUSIVE OF FILENAME cache.bin] before starting the server. It works now! Thanks! Closing this issue.

@Devbrat-Dev - I tried this but unfortunately I'm getting n_written: 0 ... and the file is not being created.

server launch:

 ./llama-server -m [FILEPATH] --verbose --host [HOST] --port [PORT] -c 8192 --mlock -t $(nproc) --slot-save-path [path/like/this/to/dir/]

chat message

has a system message and a chat history (system message + few shot prompt)
Includes new chat question
send cache_prompt as extra_body parameter
returns assistant response after long processing time
Runs much faster if issuing similar followups that use the same base (system+few-shot)

request:

curl --request POST \
  --url 'http://[IP]:[PORT]/slots/0?action=restore' \
  --header 'Content-Type: application/json' \
  --data '{
    "filename": "cache.bin"
}'

response:

{'id_slot': 0,
 'filename': 'cache.bin',
 'n_saved': 243,
 'n_written': 0,
 'timings': {'save_ms': 0.068}}

The n_saved updates... but unfortunately the file isn't created or being added to. My prompt requests are using the openai compatible server endpoint and I'm passing cache_prompt: true as an extra_body parameter.

darien-schettler commented 1 week ago

Thanks for the support. The tl;dr is that currently the argument is ignored by llama-server.

That said, you can make a work around by using cache_prompt in the request to the server paired with --slot-save-path when starting the server, and then leveraging the post requests to the /slots/ endpoint to save, restore, delete the slots kv cache.

With only 1 slot this should be equivalent to what I was trying to achieve initially.

Thanks for the support everyone.

ggerganov / llama.cpp