Is kv-cache actually not used in all the LLM-evaluation tasks, since those tasks usually takes only one-step attention calculation, not like language generating process which needs a lot of kv-cache since the words need to be generated one by one?
If this is true, how to evaluate the quantization performace if kv-cache needs to be quantized, if we want to accelerate LLM using something like GPTQ?(since kv-cache is actually not used in normal evaluation tasks)
In OmniQuant codes, it seems that, there exists no flags to control the kv-cache quantzation(only actually k and v matrix quantization, but not the cache quantization)
Hoping to discuss with OmniQuant authors about this.
For question 1, the answer is yes. KV-cache quantization only works with more than one-word generation.
For question 2, Omniquant only takes fake quantization on weight-activation quantization and cannot obtain actual acceleration. You can refer to some recent works for more information, such as streaming-llm (focus on kv-cache compression), QUIK (focus on actual speedup of 4-bit weight-activation quantization).
Is kv-cache actually not used in all the LLM-evaluation tasks, since those tasks usually takes only one-step attention calculation, not like language generating process which needs a lot of kv-cache since the words need to be generated one by one?
If this is true, how to evaluate the quantization performace if kv-cache needs to be quantized, if we want to accelerate LLM using something like GPTQ?(since kv-cache is actually not used in normal evaluation tasks)
In OmniQuant codes, it seems that, there exists no flags to control the kv-cache quantzation(only actually k and v matrix quantization, but not the cache quantization)
Hoping to discuss with OmniQuant authors about this.