Open josephrocca opened 3 months ago
Probably the easiest/simplest feature that would solve this for me would be to have some way to know whether a particular request hit the prefix cache or missed it. E.g. in the first or last EventStream JSON object it could just be an integer for how many tokens were pulls from the prefix cache, or something.
Similar things can be done. Since LMDeploy currently does not have any infrastructure for monitoring, such as the open-source implementation at https://github.com/vectorch-ai/ScaleLLM/tree/main/monitoring, if we want to support, we need to design the overall functionality.
Based on my previous experience, the implementation of monitoring usually falls into the following categories:
After having the monitoring function, we can conveniently calculate the statistics of FTL and PTL, as well as data related to the Prefix Cache that you mentioned.
cc @lzhangzz @lvhan028
Due to @lzhangzz having higher priority matters to attend to, please refer to https://github.com/InternLM/lmdeploy/issues/1970#issuecomment-2217138900 for details. I will follow up on this issue for now.
cc @ispobock
Motivation
I'm trying to optimize a production scenario where I need to fit a 70B parameter model within 48GB of VRAM, and after the model weights there is only enough room for about 20 chat threads in the prefix cache. I'm trying to work out how much of a bottleneck this is in "the real world" - i.e. with actual traffic, since there are a lot of factors that are a bit difficult to test - different prompt lengths, bursts of requests, various amounts of overlap between different chat threads, etc.
It would be great if there were some way to count evictions so it would be clear if the limited VRAM for prefix cache is causing a bottleneck.
(Even more ideal would be if the cache manager kept hashes of recently evicted content so it can determine whether a recently evicted prefix was "requested", so it would be able to automatically detect "cache thrashing")
Related resources
No response
Additional context
No response