NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.59k stars 829 forks source link

Llama-2 13B SmoothQuant W8A8 Per-Tensor TP-4 performance is poor in v0.9.0 release #1618

Open vnkc1 opened 2 months ago

vnkc1 commented 2 months ago

System Info

GPUs: A100, 4 GPUs (40 GB memory) Release: tensorrt-llm 0.9.0

Who can help?

@Tracin

Information

Tasks

Reproduction

  1. Install tensorrt-llm 0.9.0
  2. Create LLama-2 13B chat, TP-4, SmoothQuant 0.5, Per-Tensor checkpoint and engine
  3. Create LLama-2 13B chat, TP-4, SmoothQuant 0.5, Per-Channel + Per-Token checkpoint and engine
  4. Run mmlu.py

Expected behavior

Similar performance on MMLU between Per-Tensor and Per-Channel + Per-Token

actual behavior

  1. LLama-2 13B chat, SmoothQuant 0.5, TP-4 Per-Channel + Per-Token Average accuracy - 54.52 STEM - 43.31 Humanities - 49.8 Social Science - 62.04 Misc - 60.24

  2. LLama-2 13B chat, SmoothQuant 0.5, TP-4 Per-Tensor Average accuracy - 29.41 STEM - 29.56 Humanities - 25.65 Social Science - 28.31 Misc - 31.77

additional notes

n/a

byshiue commented 1 month ago

Why do you expect the accuracy of Per-Tensor and Per-Channel + Per-Token are close? It is expected that Per-Channel + Per-Token has higher accuracy.

vnkc1 commented 1 month ago

Is a 24% drop in MMLU 5-shot accuracy for Llama-2 13B expected?

byshiue commented 1 month ago

It is hard to say it is expected or not because it is related to quantization workflow and model. But the Per-Channel + Per-Token is suggested and can keep the accuracy well. Could you explain why do you want to use Per-Tensor?

Hongbosherlock commented 1 month ago

It is hard to say it is expected or not because it is related to quantization workflow and model. But the Per-Channel + Per-Token is suggested and can keep the accuracy well. Could you explain why do you want to use Per-Tensor?

May I ask how the Per-Token is computed on the fly? Can you please point out where the code is?

byshiue commented 1 month ago

Here is an example.

Hongbosherlock commented 4 days ago

Here is an example.

As far as I know, per-token is generally used together with SmoothQuant. I noticed that the SmoothQuant plugin includes per-token-plugin. What is the relationship between the per-token plugin code here and the code you referred to? thanks!

byshiue commented 2 days ago

The code you refer is used to quantize the input tensor from higher precision to int8 before entering the SmoothQuant GEMM.