[QUESTION] Questions about what hit rate means and how it's calculated

bytedance / ShadowKV

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

https://bytedance.github.io/ShadowKV/

Apache License 2.0

126 stars 6 forks source link

[QUESTION] Questions about what hit rate means and how it's calculated #3

Open DoubleEspresso-7 opened 2 days ago

DoubleEspresso-7 commented 2 days ago

Hello, this work is amazing, but I have some questions about what hit rate means and how it's calculated.

In Figure 5, after finding the IDs of top-k chunks, if the chunk is missed, you will get the missed KV cache, but there is no more inforamation for me to understand what the hit or miss means in Algorithm 2 or somewhere in your paper.

So, I want to know what the hit and miss means and how to calculate the hit rate.

Looking forward to your answers.

Thank you!

preminstrel commented 2 days ago

Thank you for your interest in ShadowKV.

The hit rate of the KV cache refers to the observation that, during decoding, the KV cache pairs selected by the queries of two adjacent decoding steps have a repetition rate of approximately 60%.

This insight allows us to optimize decoding by bypassing low-rank reconstruction and CPU data fetching for the repeated portions, focusing only on the non-repeated segments. As a result, this significantly reduces the overall decoding overhead.

DoubleEspresso-7 commented 1 day ago

Thank you for your kind answers！

Here's my understanding of the hit rate： In the process of decoding, we will keep current KV Cache in the HBM, which contains K_outlier, V_outlier and the previous KV cache. So, for the next decoding, if the selected top-k chunk IDs have been in the current KV Cache, we will call it the hit.

I hope you can answer whether my understanding is correct or not.

And, the current KV Cache will contains all the previous selected KV or just the last one? Or there is a limit for the current KV Cache?

Looking forward to your answers.

Thank you!

preminstrel commented 1 day ago

Hello, your understanding is correct.

Our cache implementation is actually quite straightforward—we avoid complex mechanisms like LRU, ensuring no additional memory consumption. Specifically, during each decoding step, we overwrite the KV pairs from the previous step using the same buffer, effectively reusing memory without allocating extra space.

If more complex cache rules are used, the hit rate can be improved, but it will bring some additional memory overhead.