OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
626 stars 49 forks source link

general question about LLM kv-cache quantization #41

Closed brisker closed 6 months ago

brisker commented 7 months ago
  1. Is kv-cache actually not used in all the LLM-evaluation tasks, since those tasks usually takes only one-step attention calculation, not like language generating process which needs a lot of kv-cache since the words need to be generated one by one?

  2. If this is true, how to evaluate the quantization performace if kv-cache needs to be quantized, if we want to accelerate LLM using something like GPTQ?(since kv-cache is actually not used in normal evaluation tasks)

In OmniQuant codes, it seems that, there exists no flags to control the kv-cache quantzation(only actually k and v matrix quantization, but not the cache quantization)

Hoping to discuss with OmniQuant authors about this.

ChenMnZ commented 7 months ago
  1. For question 1, the answer is yes. KV-cache quantization only works with more than one-word generation.
  2. For question 2, Omniquant only takes fake quantization on weight-activation quantization and cannot obtain actual acceleration. You can refer to some recent works for more information, such as streaming-llm (focus on kv-cache compression), QUIK (focus on actual speedup of 4-bit weight-activation quantization).