Azeirah commented 1 year ago

New research just came out on using a technique inspired by kernel virtual memory and pages to manage the KV cache.

Results? Way faster inference!

They claim up to 24x the throughput (measured in requests handled per second) compared to huggingface's transformers library

afbeelding

How?

Inference is bottlenecked by memory, most notably the KV cache. They say the KV cache's most notable features are

That it's very large
That it's dynamic, size depends on sequence length which is variable. Existing systems waste 60-80% of memory due to fragmentation and over-reservation

PagedAttention is an alternative approach to managing the KV cache which is inspired by virtual memory, pages and blocks. By allocating the space dynamically with this approach, only up to about 4% of memory will be wasted, instead of the aforementioned 60-80.

For further details, better refer to their website and Github.

JohannesGaessler commented 1 year ago

llama.cpp currently only ever serves one user at a time so this optimization is not applicable.

nivibilla commented 1 year ago

I assume it would be useful if we want to host the models and have a interface like chat.openai.com?

JohannesGaessler commented 1 year ago

Yes, for enterprise use where you have one server generating responses for many users in parallel the optimization would be useful.

Azeirah commented 1 year ago

llama.cpp currently only ever serves one user at a time so this optimization is not applicable.

Oh I wasn't aware this was exclusively for a client-server application, that explains why they measure performance in requests/sec 🥲

howard0su commented 1 year ago

this optimization is still applicable as it can save vram usage of kv tensor.

nivibilla commented 1 year ago

If we do end up building this for server use and I think that would be a good idea. Then this paging system would be very useful.

howard0su commented 1 year ago

Read through the blog and the code. It turns out the paged attention is a way to manage the memory so that the compute kernel doesn't require kv have to be continues. This make it possible that you can have one prompt's kv append by multi output's KVs. like the following

Prompt KV Block ------ Output 1 KV Block
                            ------ Output 2 KV block
                              ....

This is super helpful if your prompt is long and you need to output multi results. This is a purely engineering trick. The change is mainly around the how we manage the KV in VRAM. If we are using CPU, this is even simpler to implement. (simple as list v.s. vector)

slaren commented 1 year ago

We allocate all the KV memory required for the maximum context length on startup in one block, so we shouldn't have any fragmentation either.

randxie commented 1 year ago

@JohannesGaessler Is serving multiple users concurrently or batch inference on the roadmap of llama.cpp?

JohannesGaessler commented 1 year ago

I don't have any plans for it because I don't care about commercial use but I can't speak for the other devs.

okpatil4u commented 1 year ago

Should it not be on the list ?

Today we are talking about chatbots, in 6 months or so, people will start looking for autonomous agents.

Would it not make sense to build a system that can process multiple requests simultaneously and efficiently ?

On Sun, 25 Jun 2023 at 4:20 PM, Rand Xie @.***> wrote:

@JohannesGaessler https://github.com/JohannesGaessler Is serving multiple users concurrently or batch inference on the roadmap of llama.cpp?

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/1955#issuecomment-1606023542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4HR4DKJ5ZMVHMQ2T7DXNAJWLANCNFSM6AAAAAAZN5MVXY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

nivibilla commented 1 year ago

Yeah, I think first we need to solve batch inference. It's implemented in babyllama but I'm haven't tried to port it over to the main llama yet

JohannesGaessler commented 1 year ago

I'm not really concerned with what other people want to use llama.cpp for. I'm implementing things that are useful for me personally first and foremost. And I don't see how I would benefit from batched inference since I only run llama.cpp for myself on my own hardware.

nivibilla commented 1 year ago

That's fair, batch inference would be useful for me use this at scale. For example if I want to do sentiment analysis for a large dataset or summarisation at scale.

nivibilla commented 1 year ago

And in this case having a server to handle multiple users at the same time

vikigenius commented 1 year ago

I have a comparison for the pytorch implementations with and without paging on a single GPU and the gains are significant. My use case is primarily batch inference, so I am not sure about model serving.

WIth a 40 GB A100 GPU

Inference on a vicuna-13B model without paged attention produces 20 tokens / sec Inference on a vicuna-13B model with paged attention produces 190 tokens / sec

So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention.

okpatil4u commented 1 year ago

Thanks Vikash. You mentioned in another thread, that there may be some misalignment in terms of understanding, in this thread, on how vllm works. Could you please explain what you meant by it ?

Also, there have been other comments in terms its effect on CPU, GPU and Mac M1/M2 GPU in terms of performance. Could you or someone else shed some light on it ?

keeganmccallum commented 1 year ago

From what I understand this isn't so much related to multi-user/client-server use case so much as it it is batched inference, which does seem to be a valid use case even for single-user/local apps, depending on the use case

chrfalch commented 1 year ago

Wouldn’t the decreased memory requirement (they state that they cut 55% memory usage) be positive when running inference on smaller devices like phones and laptops as well?

FNsi commented 1 year ago

Should be useful if there's a large context.

viktor-ferenczi commented 1 year ago

Both vLLM and lmDeploy have high throughput batch-inference modes with various tricks. Problem is they don't support GGUF.

How complex would it be to port those tricks (KV cache paging, dynamic batching) to llama.cpp?

KerfuffleV2 commented 1 year ago

2813 - still need to implement the non-tricky version.

Related, there's #2969 - also should be a 50% memory use reduction.

kiratp commented 1 year ago

2813 only covers "same prompt, multiple output", not "multiple prompt, multiple output".

henk717 commented 1 year ago

Would like to voice my support for this, over at the KoboldAI community we had requests for multi-user support and it would also help out our Horde platform which currently benefits from TGI's speed but TGI has poor output for us compared to Llamacpp.

Having Llamacpp be fast for these use cases means multiple communities would begin using it as a general purpose inference server which would be a cool addition for the project (Once multiple requests can be queued up).

tikikun commented 1 year ago

I think this feature is important to make llama cpp usage spread even more

viktor-ferenczi commented 1 year ago

Which one would be easier? Porting performance/throughput tricks into llama.cpp or porting GGUF support into vLLM?

(lmDeploy is out of the picture, since they don't want to support GGUF. They closed the feature request / suggestion ticket, since they want to concentrate on other things.)

randxie commented 1 year ago

IMO, implementing the same idea inside llama.cpp is much better. Currently, vllm leverages Pytorch extension to customize the attention kernel. One benefit of llama.cpp is that it gets rid of pytorch and is more friendly to edge deployment.

We can consider porting the kernels in vllm into llama.cpp. It probably requires a certain amount of refactoring in llama.cpp though..

bobqianic commented 1 year ago

3479

naik-amey commented 1 year ago

Where is the KVCacheManager implemented, is it on the GPU or host (CPU)?

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

64933988 commented 7 months ago

很多的一项优化，居然没有人想集成进来！！！

phymbert commented 7 months ago

很多的一项优化，居然没有人想集成进来！！！

Please discuss in english here, and would you please elaborate which feature as of today no one want to integrate ?

K-Mistele commented 6 months ago

Worth re-opening? the server executable can handle multiple users at a time so it seems like this would be a really valuable thing to add.

YanlinWangWang commented 5 months ago

Worth re-opening? the server executable can handle multiple users at a time so it seems like this would be a really valuable thing to add.

And it can help reduce gpu-memory usage.I think it's time to start work

ggerganov / llama.cpp

Investigate PagedAttention KV-cache memory management for faster inference #1955

How?

2813 - still need to implement the non-tricky version.

2813 only covers "same prompt, multiple output", not "multiple prompt, multiple output".

3479