NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.47k stars 957 forks source link

awq quantization with gemma 2 9b #2327

Open Alireza3242 opened 1 week ago

Alireza3242 commented 1 week ago

System Info

a100

Who can help?

@Tracin

Information

Tasks

Reproduction

I tried to quantize a gemma 2 9B model with awq.

python3 examples/quantization/quantize.py --model_dir ./data/merged_model \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --output_dir ./data/tllm_checkpoint \
                                   --calib_size 32 \
                                   --calib_dataset ./src/quantization/dataset

Expected behavior

quantize without error

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.13.0
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/tensor_quant.py:92: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  scaled_e4m3_abstract = torch.library.impl_abstract("trt::quantize_fp8")(
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:06<00:00,  1.29it/s]
Inserted 885 quantizers
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_quant.py:131: DeprecationWarning: forward_loop should take model as argument, but got forward_loop without any arguments. This usage will be deprecated in future versions.
  return calibrate(model, config["algorithm"], forward_loop=forward_loop)
Caching activation statistics for awq_lite...
Searching awq_lite parameters...
Loading extension modelopt_cuda_ext...
Cannot export model to the model_config. The modelopt-optimized model state_dict (including the quantization factors) is saved to data/tllm_checkpoint/modelopt_model.0.pth using torch.save for further inspection.
Detailed export error: 'parallel_attn_mlp_res'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 435, in export_tensorrt_llm_checkpoint
    for tensorrt_llm_config, weights in torch_to_tensorrt_llm_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 259, in torch_to_tensorrt_llm_checkpoint
    layer_config = build_decoder_config(
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 1296, in build_decoder_config
    assert model_metadata_config[
KeyError: 'parallel_attn_mlp_res'
/usr/lib/python3.10/tempfile.py:1008: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpxshzhyw6'>
  _warnings.warn(warn_message, ResourceWarning)

additional notes

no additional notes

Alireza3242 commented 6 days ago

I solved this problem with some changes for gemma 2 9B, but still we have problem with gemma 2 27B:

/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py
in MODEL_NAME_PATTERN_MAP add "Gemma2": "gemma2" before "Gemma": "gemma"
/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/tensorrt_llm_utils.py
in MODEL_NAME_TO_HF_ARCH_MAP change "gemma2": "GemmaForCausalLM" to "gemma2": "Gemma2ForCausalLM"

But in gemma 2 27b, when we quantize with awq, i have another problem. In inference time, result.output_token_ids always equals to [[-1]]

imilli commented 3 days ago

@Alireza3242 about you solved this problem with some changes for gemma 2 9B. How did you change it?

imilli commented 3 days ago

@Superjomn When can this problem be solved? I am also stuck here.

Alireza3242 commented 3 hours ago

@imilli 1- /usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py in MODEL_NAME_PATTERN_MAP set "Gemma2": "gemma2" up of list 2- /usr/local/lib/python3.10/dist-packages/modelopt/torch/export/tensorrt_llm_utils.py in MODEL_NAME_TO_HF_ARCH_MAP set:"gemma2": "Gemma2ForCausalLM",