Open vnkc1 opened 2 months ago
Why do you expect the accuracy of Per-Tensor
and Per-Channel + Per-Token
are close? It is expected that Per-Channel + Per-Token
has higher accuracy.
Is a 24% drop in MMLU 5-shot accuracy for Llama-2 13B expected?
It is hard to say it is expected or not because it is related to quantization workflow and model. But the Per-Channel + Per-Token
is suggested and can keep the accuracy well. Could you explain why do you want to use Per-Tensor
?
It is hard to say it is expected or not because it is related to quantization workflow and model. But the
Per-Channel + Per-Token
is suggested and can keep the accuracy well. Could you explain why do you want to usePer-Tensor
?
May I ask how the Per-Token
is computed on the fly? Can you please point out where the code is?
The code you refer is used to quantize the input tensor from higher precision to int8 before entering the SmoothQuant GEMM.
System Info
GPUs: A100, 4 GPUs (40 GB memory) Release: tensorrt-llm 0.9.0
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Similar performance on MMLU between Per-Tensor and Per-Channel + Per-Token
actual behavior
LLama-2 13B chat, SmoothQuant 0.5, TP-4 Per-Channel + Per-Token Average accuracy - 54.52 STEM - 43.31 Humanities - 49.8 Social Science - 62.04 Misc - 60.24
LLama-2 13B chat, SmoothQuant 0.5, TP-4 Per-Tensor Average accuracy - 29.41 STEM - 29.56 Humanities - 25.65 Social Science - 28.31 Misc - 31.77
additional notes
n/a