general question about LLM kv-cache quantization

OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

MIT License

626 stars 49 forks source link

Is kv-cache actually not used in all the LLM-evaluation tasks, since those tasks usually takes only one-step attention calculation, not like language generating process which needs a lot of kv-cache since the words need to be generated one by one?
If this is true, how to evaluate the quantization performace if kv-cache needs to be quantized, if we want to accelerate LLM using something like GPTQ?(since kv-cache is actually not used in normal evaluation tasks)

In OmniQuant codes, it seems that, there exists no flags to control the kv-cache quantzation(only actually k and v matrix quantization, but not the cache quantization)

Hoping to discuss with OmniQuant authors about this.

OpenGVLab / OmniQuant

general question about LLM kv-cache quantization #41