InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.53k stars 317 forks source link

[Docs] Where is prefix cache data stored? #1737

Closed josephrocca closed 1 month ago

josephrocca commented 1 month ago

I'm guessing prefix cache is stored in the GPU VRAM. I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache? Or would that generally be too slow? I.e. faster to just recompute the data rather than waiting for it to stream from CPU to GPU?

josephrocca commented 1 month ago

Also, as an aside, the docs say:

--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

but this is a bit confusing, and lacking in what I think are important details. TGI has --cuda-memory-fraction and vLLM has --gpu-memory-utilization, and the engine itself will determine how much to allocate to k/v cache, based on how much VRAM is available - this is simple and less confusing. I'm not sure if this approach is compatible with lmdeploy's paradigm, but I'm just mentioning it as a point of comparison.

I eventually found this page: https://lmdeploy.readthedocs.io/en/v0.4.2/api/pipeline.html#turbomindengineconfig but it was a bit hard to find, and I think should be linked from the above-mentioned point in the docs.

For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache

I think a new parameter should be added with a more appropriate name. I'm also wondering if it is safe to set this value to 1.0? Or is the default 0.8 to prevent CUDA OOM errors? Or is it for testing on Desktop (non-server) machines where some VRAM must be left for the OS/desktop rendering? If so then it is safe to set it to 1.0 on servers?

zhyncs commented 1 month ago

You may refer to the design and implementation of prefix cache as follows:

design https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2044203407

implementation https://github.com/InternLM/lmdeploy/pull/1450

The implementation is consistent with the overall design, and subsequent modifications were made based on the review comments in the PR.

I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache?

Prefix Cache is a reuse of the existing KV cache blocks, there is no need to do this.

zhyncs commented 1 month ago

the engine itself will determine how much to allocate to k/v cache, based on how much VRAM is available

yep. vLLM uses the profile_run to check if the gpu-memory-utilization is ok.

https://github.com/vllm-project/vllm/blob/c96fc067479453b02e92d9378eeeaebb6b3816de/vllm/worker/worker.py#L135-L154

I have mentioned using a similar approach before. https://github.com/InternLM/lmdeploy/pull/973#issuecomment-1899662058 And in the end, it is still implemented according to ratio * free https://github.com/InternLM/lmdeploy/pull/973#issuecomment-1907553394

I think a new parameter should be added with a more appropriate name.

In fact, it can be a ratio or a count.

https://github.com/InternLM/lmdeploy/blob/735d9a3742f4f52120dc1fd7b8081086e21e8224/src/turbomind/models/llama/BlockManager.cc#L30-L39

josephrocca commented 1 month ago

Thank you very much for the details!

I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache?

Prefix Cache is a reuse of the existing KV cache blocks, there is no need to do this.

I'm not sure you mean by "there is no need" - wouldn't it be better to have more cache storage space if VRAM is limited?

For example: Currently with two RTX 4090s, using LLama2 70B, I can store about 32 prefixes in the free VRAM, since the 70B model consumes almost all of the 48GB VRAM. This means that there's a high probability of cache eviction if there are e.g. 100 people using the service.

So if it were possible to use the system/cpu RAM too, that would be awesome, because I have >100 GB of system RAM available, so there would be a lower cache eviction probability for each request.

zhyncs commented 1 month ago

The overhead from using CPU offloading outweighs the benefits. None of the mainstream frameworks have successfully implemented high-performance and effective CPU offloading. It is a low priority at the moment.

josephrocca commented 1 month ago

Ah, I see. Thanks for explaining! I'll close this issue now.