TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
[TensorRT-LLM][WARNING] The manually set model data type is torch.float16, but the data type of the HuggingFace model is torch.float32.
Initializing tokenizer from model
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
AWQ calibration could take longer than other calibration methods. Please increase the batch size to speed up the calibration process. Batch size can be set by adding the argument --batch_size to the command line.
Loading calibration dataset
[NeMo W 2024-07-31 06:37:10 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by promote_options='default'.
table = cls._concat_blocks(blocks, axis=0)
[NeMo W 2024-07-31 06:38:15 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor).
value = torch.tensor(value, device=self._pre_quant_scale.device)
Traceback (most recent call last):
File "/app/TensorRT-LLM/examples/quantization/quantize.py", line 364, in
main(args)
File "/app/TensorRT-LLM/examples/quantization/quantize.py", line 284, in main
model = quantize_model(model, quant_cfg, calib_dataloader)
File "/app/TensorRT-LLM/examples/quantization/quantize.py", line 221, in quantize_model
atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/model_quant.py", line 112, in quantize
calibrate(model, config["algorithm"], forward_loop=forward_loop)
File "ammo/torch/quantization/model_calib.py", line 59, in ammo.torch.quantization.model_calib.calibrate
File "ammo/torch/quantization/model_calib.py", line 185, in ammo.torch.quantization.model_calib.awq
File "ammo/torch/quantization/model_calib.py", line 187, in ammo.torch.quantization.model_calib.awq
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "ammo/torch/quantization/model_calib.py", line 330, in ammo.torch.quantization.model_calib.awq_lite
File "/app/TensorRT-LLM/examples/quantization/quantize.py", line 217, in calibrate_loop
model(data)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1068, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 796, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 386, in forward
query_states = self.q_proj(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, **kwargs)
File "ammo/torch/quantization/model_calib.py", line 294, in ammo.torch.quantization.model_calib.awq_lite.forward
NotImplementedError: Cannot copy out of meta tensor; no data!
System Info
Getting this error while trying to quantize llama 3 8b model with tensorrt_llm 0.9.0
GPU A10, 24 GB Docker : 23.10-trtllm-python-py3
ref : https://github.com/NVIDIA/TensorRT-LLM/issues/1182
Who can help?
@byshiue @Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python3 TensorRT-LLM/examples/quantization/quantize.py --model_dir model \ --output_dir tllm_checkpoint_1gpu_awq \ --dtype float16 \ --qformat int4_awq \ --awq_block_size 128
Expected behavior
Quantized model as output
actual behavior
Error while quantizing
additional notes
python pacakges