[Feature] Prefix cache hit/miss/eviction statistics to detect cache thrashing

InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

https://lmdeploy.readthedocs.io/en/latest/

Apache License 2.0

4.35k stars 390 forks source link

[Feature] Prefix cache hit/miss/eviction statistics to detect cache thrashing #1942

Open josephrocca opened 3 months ago

josephrocca commented 3 months ago

Motivation

I'm trying to optimize a production scenario where I need to fit a 70B parameter model within 48GB of VRAM, and after the model weights there is only enough room for about 20 chat threads in the prefix cache. I'm trying to work out how much of a bottleneck this is in "the real world" - i.e. with actual traffic, since there are a lot of factors that are a bit difficult to test - different prompt lengths, bursts of requests, various amounts of overlap between different chat threads, etc.

It would be great if there were some way to count evictions so it would be clear if the limited VRAM for prefix cache is causing a bottleneck.

(Even more ideal would be if the cache manager kept hashes of recently evicted content so it can determine whether a recently evicted prefix was "requested", so it would be able to automatically detect "cache thrashing")

Related resources

No response

Additional context

No response

josephrocca commented 3 months ago

Probably the easiest/simplest feature that would solve this for me would be to have some way to know whether a particular request hit the prefix cache or missed it. E.g. in the first or last EventStream JSON object it could just be an integer for how many tokens were pulls from the prefix cache, or something.

zhyncs commented 3 months ago

Similar things can be done. Since LMDeploy currently does not have any infrastructure for monitoring, such as the open-source implementation at https://github.com/vectorch-ai/ScaleLLM/tree/main/monitoring, if we want to support, we need to design the overall functionality.

zhyncs commented 3 months ago

Based on my previous experience, the implementation of monitoring usually falls into the following categories:

In large companies, there is usually a relatively complete data collection and monitoring display system, similar to CAT. In this case, we only need to integrate the corresponding SDK and use it according to the documentation.
In open source software, such as the aforementioned ScaleLLM implementation, prometheus and grafana are used, which is also a common practice compatible with open source toolchains.
The third type is based on my work experience at Baidu. For C++ projects, we can implement it by inheriting bvar from bRPC. Because TurboMind is used as a Python so in LMDeploy, this approach may cause some trouble and requires proof of concept verification.

After having the monitoring function, we can conveniently calculate the statistics of FTL and PTL, as well as data related to the Prefix Cache that you mentioned.

cc @lzhangzz @lvhan028

zhyncs commented 2 months ago

Due to @lzhangzz having higher priority matters to attend to, please refer to https://github.com/InternLM/lmdeploy/issues/1970#issuecomment-2217138900 for details. I will follow up on this issue for now.

zhyncs commented 2 months ago

cc @ispobock