Create INT8 KV Cache on Qserve

dleunji commented 1 week ago

Hi,

Thanks for your contributions and updates of Qserve.

I added an INT8KV feature as well.

Previously, the scale factor was calculated using the maximum value among the outputs of q_proj, k_proj, and v_proj. (code)

However, I found that it is not working in Qserve.

It only works well in Qserve when the scale factor is calculated based solely on the outputs of k_proj and v_proj. This is different from the INT8 KV Cache in Qserve paper, which uses a dynamic cache. However, this int8 kv cache is a sufficient alternative for Qserve with high accuracy.

[Reference] Qserve retrieves the scale of the kv cache separately for k and v, treating each with its own scale. (code) However, TensorRT-LLM merges the k and v scales into a single kv_cache_scaling_factor derived from the outputs of qkv_proj. This setup made it difficult to use the kv cache scaling style of Qserve in TensorRT-LLM. However, I modified the approach to obtain the kv cache scale without considering q_proj, making it more similar to Qserve. And I got much higher quality of outputs.

lkm2835 commented 1 week ago

This is related to #2444

bobboli commented 4 days ago

Hi, Thank you for your contribution! The current checkpoint conversion is implemented in the legacy path, whereas we plan to migrate to the unified converter in the future. After that we can handle the combination of KV cache quantization with w4a8 in a more unified way.

Since you modified load_weights_from_lmquant heavily which will be deprecated, we will not proceed with this PR. But we will refer to your observation of not using q_proj for calibration.

Thank you!

NVIDIA / TensorRT-LLM

Create INT8 KV Cache on Qserve #2446