NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.63k stars 985 forks source link

INT4 AWQ quantization fails for Llama 2 7B & 13B with higher tensor parallel degrees #1636

Closed ethnzhng closed 5 months ago

ethnzhng commented 5 months ago

System Info

Who can help?

@Tracin

Information

Reproduction

Run quantize.py using int4_awq with tp_size 4, 8 for Llama 2 7B, or with tp_size 8 for 13B

e.g.

python ../quantization/quantize.py --model_dir /llama-2-7b-hf \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --output_dir ./quantized_int4-awq \
                                   --tp_size 4

Expected behavior

Quantization is successful

actual behavior

Llama 2 7B

tp=4
Traceback (most recent call last):                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config                 
    for model_config in torch_to_model_config(                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 240, in torch_to_model_config               
    pack_linear_weights(model_config)                                                                                                       
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_utils.py", line 283, in pack_linear_weights                  
    linear_layer.weight = to_quantized_weight(                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_utils.py", line 233, in to_quantized_weight                  
    (weight / weights_scaling_factor[:, torch.arange(in_dim) // block_size])                                                                
IndexError: index 22 is out of bounds for dimension 0 with size 22
tp=8
...                                                                       
IndexError: index 11 is out of bounds for dimension 0 with size 11  

Llama 2 13B

tp=8
...                                                         
IndexError: index 14 is out of bounds for dimension 0 with size 14

additional notes

Llama 3 8B is able to be quantized without error under the same conditions (int4_awq & tp = 8).

byshiue commented 5 months ago

llama-2-7B with tp size 4 does not satisfy the limitation of int4 awq when awq_block_size is 128. You can set --awq_block_size 64 during quantizing the checkpoint. Similar issues for other tests. We might not be able to run 7B with TP8 due to the limitation.