NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
581 stars 44 forks source link

How to choose different alpha for mtq.INT8_SMOOTHQUANT_CFG? #28

Closed siahuat0727 closed 5 months ago

siahuat0727 commented 5 months ago

Hi, I wonder is it possible to choose different alpha for mtq.INT8_SMOOTHQUANT_CFG?

siahuat0727 commented 5 months ago

I found an example here and it works! https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/06a155388edd4a240051176a67a11886b15db082/llm_ptq/hf_ptq.py#L150 But I noticed that setting alpha != 1 in SmoothQuant leads to different scales for qkv and some linear layers, which seems to prevent fusion with the previous norm layer. Shouldn't these layers have the same smooth scale for proper fusion?

Is this a bug or am I misunderstanding something?

Thanks!

RalphMao commented 5 months ago

with alpha!=1, qkv will have different pre-quant scaling factors and we do a postprocess to resmooth it, so not a bug This also happens to AWQ.

siahuat0727 commented 5 months ago

Thanks! Clears things up on rescaling for alpha!=1. Does modelopt handle the rescaling internally? Ideally, I'd love to see an example of how to grab those resmoothed rescaling factors. @RalphMao

realAsma commented 5 months ago

@siahuat0727 modelopt handles the rescaling internally during tensorrtllm checkpoint export .

There are no public examples which showcase this.