NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

AWQ Int4 Quantization with --pp_size=2 fails #1476

Open schmidek opened 4 months ago

schmidek commented 4 months ago

System Info

NVIDIA A100 80GB x 4

Who can help?

@Tracin

Information

Tasks

Reproduction

python ../quantization/quantize.py --model_dir /data/models/Mixtral-8x22B-Instruct-v0.1 \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --output_dir ./quantized_int4-awq \
                                   --batch_size 8 \
                                   --pp_size=2

Expected behavior

Model is quantized and exported successfully

actual behavior

Calibration and quantization works but exporting fails with:

AssertionError: Inference time pipeline parallel is only supported with export_tensorrt_llm_config on and the build API from the TensorRT LLM repo

additional notes

Logs:

[TensorRT-LLM] TensorRT-LLM version: 0.9.0
Initializing model from /data/models/Mixtral-8x22B-Instruct-v0.1
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 59/59 [32:22<00:00, 32.92s/it]
[TensorRT-LLM][WARNING] The manually set model data type is torch.float16, but the data type of the HuggingFace model is torch.bfloat16.
Initializing tokenizer from /data/models/Mixtral-8x22B-Instruct-v0.1
AWQ calibration could take longer with calib_size = 512, Using calib_size=32 instead

AWQ calibration could take longer than other calibration methods. Please increase the batch size to speed up the calibration process. Batch size can be set by adding the argument --batch_size <batch_size> to the command line.

Loading calibration dataset
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 15.6k/15.6k [00:00<00:00, 58.1MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 257M/257M [00:01<00:00, 211MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 257M/257M [00:01<00:00, 237MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 259M/259M [00:01<00:00, 234MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 34.7M/34.7M [00:00<00:00, 44.5MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 30.0M/30.0M [00:00<00:00, 76.9MB/s]
Generating train split: 100%|███████████████████████████████████████████████████████████████████████████████| 287113/287113 [00:03<00:00, 84750.30 examples/s]
Generating validation split: 100%|████████████████████████████████████████████████████████████████████████████| 13368/13368 [00:00<00:00, 89275.75 examples/s]
Generating test split: 100%|██████████████████████████████████████████████████████████████████████████████████| 11490/11490 [00:00<00:00, 85651.44 examples/s]
{'quant_cfg': {'*weight_quantizer': {'num_bits': 4, 'block_sizes': {-1: 128}, 'enable': True}, '*input_quantizer': {'enable': False}, '*lm_head*': {'enable': False}, '*output_layer*': {'enable': False}, 'default': {'enable': False}}, 'algorithm': {'method': 'awq_lite', 'alpha_step': 0.1}}
Starting quantization...
Replaced 4875 modules to quantized modules
Caching activation statistics for awq_lite...
Calibrating batch 0
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Searching awq_lite parameters...
Calibrating batch 0
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Quantization done. Total time used: 384.39 s.
Unknown model type MixtralForCausalLM. Continue exporting...
Traceback (most recent call last):
  File "/data/src/TensorRT-LLM/examples/llama/../quantization/quantize.py", line 52, in <module>
    quantize_and_export(model_dir=args.model_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 334, in quantize_and_export
    export_model_config(model,
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 295, in export_model_config
    assert export_tensorrt_llm_config, (
AssertionError: Inference time pipeline parallel is only supported with export_tensorrt_llm_config on and the build API from the TensorRT LLM repo
Tracin commented 4 months ago

@schmidek Hi, quantization with Mixtral is not supported in current version of AMMO.