Closed josephrocca closed 1 month ago
Also, as an aside, the docs say:
--cache-max-entry-count
to adjust the GPU mem ratio for k/v cache etc.
but this is a bit confusing, and lacking in what I think are important details. TGI has --cuda-memory-fraction
and vLLM has --gpu-memory-utilization
, and the engine itself will determine how much to allocate to k/v cache, based on how much VRAM is available - this is simple and less confusing. I'm not sure if this approach is compatible with lmdeploy's paradigm, but I'm just mentioning it as a point of comparison.
I eventually found this page: https://lmdeploy.readthedocs.io/en/v0.4.2/api/pipeline.html#turbomindengineconfig but it was a bit hard to find, and I think should be linked from the above-mentioned point in the docs.
For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache
I think a new parameter should be added with a more appropriate name. I'm also wondering if it is safe to set this value to 1.0? Or is the default 0.8 to prevent CUDA OOM errors? Or is it for testing on Desktop (non-server) machines where some VRAM must be left for the OS/desktop rendering? If so then it is safe to set it to 1.0 on servers?
You may refer to the design and implementation of prefix cache as follows:
design https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2044203407
implementation https://github.com/InternLM/lmdeploy/pull/1450
The implementation is consistent with the overall design, and subsequent modifications were made based on the review comments in the PR.
I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache?
Prefix Cache is a reuse of the existing KV cache blocks, there is no need to do this.
the engine itself will determine how much to allocate to k/v cache, based on how much VRAM is available
yep. vLLM uses the profile_run
to check if the gpu-memory-utilization
is ok.
I have mentioned using a similar approach before. https://github.com/InternLM/lmdeploy/pull/973#issuecomment-1899662058 And in the end, it is still implemented according to ratio * free
https://github.com/InternLM/lmdeploy/pull/973#issuecomment-1907553394
I think a new parameter should be added with a more appropriate name.
In fact, it can be a ratio or a count.
Thank you very much for the details!
I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache?
Prefix Cache is a reuse of the existing KV cache blocks, there is no need to do this.
I'm not sure you mean by "there is no need" - wouldn't it be better to have more cache storage space if VRAM is limited?
For example: Currently with two RTX 4090s, using LLama2 70B, I can store about 32 prefixes in the free VRAM, since the 70B model consumes almost all of the 48GB VRAM. This means that there's a high probability of cache eviction if there are e.g. 100 people using the service.
So if it were possible to use the system/cpu RAM too, that would be awesome, because I have >100 GB of system RAM available, so there would be a lower cache eviction probability for each request.
The overhead from using CPU offloading outweighs the benefits. None of the mainstream frameworks have successfully implemented high-performance and effective CPU offloading. It is a low priority at the moment.
Ah, I see. Thanks for explaining! I'll close this issue now.
I'm guessing prefix cache is stored in the GPU VRAM. I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache? Or would that generally be too slow? I.e. faster to just recompute the data rather than waiting for it to stream from CPU to GPU?