abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.28k stars 867 forks source link

Add batched inference #771

Open abetlen opened 9 months ago

abetlen commented 9 months ago
JackKCWong commented 9 months ago

Silly question, does that also support for parallel decoding in llama.cpp?

steveoOn commented 9 months ago

Does the newest version support "batched decoding" of llama.cpp?

https://github.com/ggerganov/llama.cpp/pull/3228

@abetlen

LoopControl commented 8 months ago

This would be a huge improvement for production use.

I tested locally with 4 parallel requests to the built-in ./server binary in llama.cpp and am able to hit some insanely good tokens/sec -- multiple times faster than what we get with a single request via the non-batched inference.

hockeybro12 commented 8 months ago

This would be a huge improvement for production use.

I tested locally with 4 parallel requests to the built-in ./server binary in llama.cpp and am able to hit some insanely good tokens/sec -- multiple times faster than what we get with a single request via the non-batched inference.

@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!

LoopControl commented 8 months ago

@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!

There's 2 new flags in llama.cpp to add to your normal command -cb -np 4 (cb = continuous batching, np = parallel request count).

hockeybro12 commented 8 months ago

@

@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!

There's 2 new flags in llama.cpp to add to your normal command -cb -np 4 (cb = continuous batching, np = parallel request count).

Thanks, that works for me with llama.cpp, but not llama-cpp-python, which I think is expected. Unfortunately, the server API in llama.cpp here doesn't seem to be as good as the server in llama-cpp-python, at least for my task. Using the same llama model, I get better results with llama-cpp-python. So, I hope this can be added soon!

zpzheng commented 8 months ago

When will this feature be available? I hope anyone can help solve this problem please.

ggerganov commented 8 months ago

Let me know if there are any roadblocks - I might be able to provide some insight

abetlen commented 8 months ago

Hey @ggerganov I missed this earlier.

Thank you, yeah I just need some quick clarifications around the kv cache behaviour.

The following is my understanding of the kv_cache implementation

Is this correct?

ggerganov commented 8 months ago

Yes, all of this is correct.

Calling llama_kv_cache_shift works by modifying the kv cells that belong to a given sequence however this also shifts this cell in all of the other sequences it belongs to

This call also sets a flag that upon the next llama_decode, the computation will first shift the KV cache data before proceeding as usual.

Will soon add a couple of functions to the API that can be useful for monitoring the KV cache state:

https://github.com/ggerganov/llama.cpp/pull/4170

One of the main applications of llama_kv_cache_seq_cp is to "share" a common prompt (i.e. same tokens at the same positions) across multiple sequences. Most trivial example is a system prompt which is at the start for all generated sequences. By sharing it, the KV cache will be reused and thus less memory will be consumed, instead of having a copy for each sequences.

zpzheng commented 7 months ago

I updated the version and saw the batch configuration. But when I ran it, the batch didn't take effect.When I send multiple requests, it still handles them one by one. My startup configuration is as follows:

python3 -m llama_cpp.server --model ./models/WizardLM-13B-V1.2/ggml-model-f16-Q5.gguf --n_gpu_layers 2 --n_ctx 8000 --n_batch 512 --n_threads 10 --n_threads_batch 10 --interrupt_requests False

Is there something wrong with my configuration? @abetlen

LoopControl commented 7 months ago

@zpzheng It’s a draft PR so it’s not complete - you can see “Add support for parallel requests” is in the todo list

Zahgrom34 commented 6 months ago

@abetlen Is there any progress on this?

K-Mistele commented 6 months ago

+1, would be really great to have this

everyfin-in commented 5 months ago

+1, would be so great to have this!

sadaisystems commented 5 months ago

+1

ArtyomZemlyak commented 5 months ago

+1

chenwr727 commented 4 months ago

+1

Connor2573 commented 4 months ago

+1

shoaibmalek21 commented 4 months ago

Guys, any other solution in this??

jasongst commented 4 months ago

+1

ganliqiang commented 3 months ago

+1

stanier commented 3 months ago

+1

I use llama-cpp-python as (presently) the sole first-class-support backend in a project I've been developing. I'm not sure how much it would benefit from batching, as I've yet to do performance testing against other backends, but I feel like it could be a significant boon.

What's the current status of this and #951? I might be interested in taking a look at this, but I'm not certain I'd bring much to the table, I'll have to review the related code more.

K-Mistele commented 3 months ago

I use llama-cpp-python as (presently) the sole first-class-support backend in a project I've been developing.

I would not do this. batching is super important and I had to move to llama.cpp's server (easy to deploy w/ docker or python, or even just the exe) because of lack of features on llama-cpp-python. If you're doing CPU inference, llama.cpp is a great option, otherwise I would use something like vLLM, BentoML's OpenLLM, or Predibase's LoRAx

stanier commented 3 months ago

I would not do this. batching is super important and I had to move to llama.cpp's server

This is something I was considering, appreciate the advice. I'll likely end up doing that. I had to do the same with Ollama, but I wasn't on Ollama long and by no means felt it was the right fit for the job, support for it merely started from a peer showing interest and my compulsion to explore all viable options where possible.

I'm doing GPU inference and sadly that means Nvidia's antics have hindered me from getting things running in a container just the way I'd like them to up until now... but that's another story. I haven't tried vLLM, OpenLLM or LoRAx, llama.cpp and llama-cpp-python have generally been all I've needed up till now (and for longer, I hope-- I really appreciate the work done by all contributors to both projects, exciting that we're at least where we are today). Are those libraries any good if you're looking to do something with the perplexity of say q6_k on a (VRAM) budget? I'd prefer to be able to run it on my 1080Ti, even when I have access to more VRAM in another environment.

yourbuddyconner commented 1 month ago

I am dealing with this right now -- and unfortunately llama-cpp-server has the only completions endpoint that I can find that supports logprobs properly (random hard requirement I have). llama.cpp doesn't natively support it in its API, which is why I use llama-cpp-python.

vllm is great for big GPUs, however, it doesn't support gguf-quantized models or running at full-precision for smaller GPUs (I use T4s).

Whereas ideally I could run llama-cpp-python on an instance with 4 T4s and batch across them (like vllm can do), I am now creating instances with 1 gpu and scaling horizontally.

If anyone knows of a better openai-compatible API endpoint that wraps llama.cpp, I am listening, but I haven't found one.

dimaioksha commented 1 month ago

I am dealing with this right now -- and unfortunately llama-cpp-server has the only completions endpoint that I can find that supports logprobs properly (random hard requirement I have). llama.cpp doesn't natively support it in its API, which is why I use llama-cpp-python.

vllm is great for big GPUs, however, it doesn't support gguf-quantized models or running at full-precision for smaller GPUs (I use T4s).

Whereas ideally I could run llama-cpp-python on an instance with 4 T4s and batch across them (like vllm can do), I am now creating instances with 1 gpu and scaling horizontally.

If anyone knows of a better openai-compatible API endpoint that wraps llama.cpp, I am listening, but I haven't found one.

Have you tried https://github.com/ollama/ollama?

yourbuddyconner commented 1 month ago

ollama doesn't support batched inference what a silly suggestion.

https://github.com/ollama/ollama/issues/1396

NickCrews commented 1 month ago

I case this is useful to others, as a workaround until this is implemented, I wrote a tiny python library that

This was needed because the raw server binary supports batched inference. All the heavy logic is already in the upstream C server, so all I needed to do was do the CLI and subprocess logic.

dabs9 commented 2 weeks ago

Does this mean that continuous batching is not supported in llama-cpp-python? I assume this is the type of batching under consideration in this issue.

KohakuBlueleaf commented 1 day ago

Does this mean that continuous batching is not supported in llama-cpp-python? I assume this is the type of batching under consideration in this issue.

Yes, but, no Yes: continuous batching is not "utilized" in llama-cpp-python. No: you can't even just do the simplest batching which encode multiple prompt at the same time, decode multiple sequence at the same time. Continuous batching is something "beyond" this