Open dleunji opened 1 week ago
This is related to #2444
Hi, Thank you for your contribution! The current checkpoint conversion is implemented in the legacy path, whereas we plan to migrate to the unified converter in the future. After that we can handle the combination of KV cache quantization with w4a8 in a more unified way.
Since you modified load_weights_from_lmquant
heavily which will be deprecated, we will not proceed with this PR. But we will refer to your observation of not using q_proj
for calibration.
Thank you!
Hi,
Thanks for your contributions and updates of Qserve.
I added an INT8KV feature as well.
Previously, the scale factor was calculated using the maximum value among the outputs of
q_proj
,k_proj
, andv_proj
. (code)However, I found that it is not working in Qserve.
It only works well in Qserve when the scale factor is calculated based solely on the outputs of
k_proj
andv_proj
. This is different from the INT8 KV Cache in Qserve paper, which uses a dynamic cache. However, this int8 kv cache is a sufficient alternative for Qserve with high accuracy.[Reference] Qserve retrieves the scale of the kv cache separately for k and v, treating each with its own scale. (code) However, TensorRT-LLM merges the k and v scales into a single
kv_cache_scaling_factor
derived from the outputs ofqkv_proj
. This setup made it difficult to use the kv cache scaling style of Qserve in TensorRT-LLM. However, I modified the approach to obtain the kv cache scale without consideringq_proj
, making it more similar to Qserve. And I got much higher quality of outputs.