NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
294 stars 15 forks source link

How can I ignore some layers and prevent them from being quantized in AWQ quantization by configuring the config file? #33

Open shaoyanguo opened 1 week ago

shaoyanguo commented 1 week ago

During the quantization of llama13b, I modified the config to: quant_cfg["quant_cfg"]["*self_attn*"] = {'enable':False}. However, in the generated config file, under group = 0, "exclude_modules" still only has "lm_head". What should I do? "quantization": { "quant_algo": "W4A16_AWQ", "kv_cache_quant_algo": null, "group_size": 0, "has_zero_point": false, "pre_quant_scale": true, "exclude_modules": [ "lm_head" ] },

meenchen commented 4 days ago

Hi @shaoyanguo, thanks for raising this issue. We have a plan to address this issue in a future release. As the current implementation hardcodes the exclude_modules, you can manually update the config file as a workaround, e.g., adding transformer.layers.0.attention.qkv.