NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.78k stars 1k forks source link

AWQ performs worse than llm-awq #652

Open spongezz opened 11 months ago

spongezz commented 11 months ago

If I am not mistaken, the awq implemented in ammo uses a default alpha_step = 0.1 to search the parameter. However, the model quantized by ammo have a larger performance reduction than AWQ. Is ammo opensource so that I can check where I make mistake or not?

RalphMao commented 11 months ago

Hi @spongezz , could you provide more details on the accuracy difference? There are several difference between our AWQ implementation and llm-awq.

  1. lm_head is quantized in ammo (this causes accuracy drop on some models, so we just disabled it in the most recent release last weekend)
  2. ammo uses symmetric quantization instead of the asymmetric quantization in llm-awq, which will cause slight more accuracy drop
  3. llm-awq is a combination of awq scale and clipping while ammo by default only runs awq scale for fast quantization
spongezz commented 11 months ago

@RalphMao Thank you for your reply! I will recheck and give you feed back soon.

Hukongtao commented 9 months ago

Hi @spongezz , could you provide more details on the accuracy difference? There are several difference between our AWQ implementation and llm-awq.

  1. lm_head is quantized in ammo (this causes accuracy drop on some models, so we just disabled it in the most recent release last weekend)
  2. ammo uses symmetric quantization instead of the asymmetric quantization in llm-awq, which will cause slight more accuracy drop
  3. llm-awq is a combination of awq scale and clipping while ammo by default only runs awq scale for fast quantization

Same problem. There is a big difference between the score of awq and the score of fp16. Are there any other parameters that can be adjusted?

Hukongtao commented 9 months ago

@RalphMao Thank you for your reply! I will recheck and give you feed back soon.

Is your problem solved?

spongezz commented 9 months ago

@RalphMao Thank you for your reply! I will recheck and give you feed back soon.

Is your problem solved?

Not yet. I tried using full quantization but it was too slow. I don't know whether it becomes faster in the lastest commit. I just use the int 8 quant to meet my needs. Would you please let me know if you solve the problem?

Hukongtao commented 9 months ago

@RalphMao Thank you for your reply! I will recheck and give you feed back soon.

Is your problem solved?

Not yet. I tried using full quantization but it was too slow. I don't know whether it becomes faster in the lastest commit. I just use the int 8 quant to meet my needs. Would you please let me know if you solve the problem?

I now use GPTQ instead of AWQ and the accuracy is acceptable. https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md#gptq

hello-11 commented 1 week ago

@spongezz Do you still have the problem? If not, we will close it soon.