NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

Create INT8 KV Cache on Qserve #2446

Open dleunji opened 1 week ago

dleunji commented 1 week ago

Hi,

Thanks for your contributions and updates of Qserve.

I added an INT8KV feature as well.

Previously, the scale factor was calculated using the maximum value among the outputs of q_proj, k_proj, and v_proj. (code)

However, I found that it is not working in Qserve.

It only works well in Qserve when the scale factor is calculated based solely on the outputs of k_proj and v_proj. This is different from the INT8 KV Cache in Qserve paper, which uses a dynamic cache. However, this int8 kv cache is a sufficient alternative for Qserve with high accuracy.

[Reference] Qserve retrieves the scale of the kv cache separately for k and v, treating each with its own scale. (code) However, TensorRT-LLM merges the k and v scales into a single kv_cache_scaling_factor derived from the outputs of qkv_proj. This setup made it difficult to use the kv cache scaling style of Qserve in TensorRT-LLM. However, I modified the approach to obtain the kv cache scale without considering q_proj, making it more similar to Qserve. And I got much higher quality of outputs.

lkm2835 commented 1 week ago

This is related to #2444

bobboli commented 4 days ago

Hi, Thank you for your contribution! The current checkpoint conversion is implemented in the legacy path, whereas we plan to migrate to the unified converter in the future. After that we can handle the combination of KV cache quantization with w4a8 in a more unified way.

Since you modified load_weights_from_lmquant heavily which will be deprecated, we will not proceed with this PR. But we will refer to your observation of not using q_proj for calibration.

Thank you!