NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.82k stars 1.01k forks source link

Using Llama Quantize example produces AttributeError: 'NoneType' object has no attribute 'fake_tensor_quant_with_axis' #255

Closed lynkz-matt-psaltis closed 1 year ago

lynkz-matt-psaltis commented 1 year ago

When attempting to quantize a Phind CodeLlama model I receive an exception: AttributeError: 'NoneType' object has no attribute 'fake_tensor_quant_with_axis'

Using ammo 3.0. TensorRT-LLM compiled from main branch.

python /app/tensorrt_llm/examples/llama/quantize.py --model_dir /mnt/models/Phind/Phind-CodeLlama-34B-v2 \
>                 --dtype float16 \
>                 --qformat int4_awq \
>                 --export_path /mnt/models/Phind/Phind-CodeLlama-34B-v2.pt \
>                 --calib_size 256
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior wil
l be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://
github.com/huggingface/transformers/pull/24565
[11/02/2023-14:23:10] The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards:   0%|                                                                                                                                                                              | 0/7 [00:00<?, ?it/sLLoading checkpoint shards: 100%|7/7 [07:20<00:00, 62.88s/it]
Loading calibration dataset
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2436: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length
'`.
  warnings.warn(
Replaced 1011 modules to quantized modules
Caching activation statistics for awq_lite...
Searching awq_lite parameters...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:152: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().det
ach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/quantize.py", line 146, in <module>
    main()
  File "/app/tensorrt_llm/examples/llama/quantize.py", line 139, in main
    model = quantize_and_export(model,
  File "/app/tensorrt_llm/tensorrt_llm/models/quantized/ammo.py", line 79, in quantize_and_export
    model = _quantize_model(model,
  File "/app/tensorrt_llm/tensorrt_llm/models/quantized/ammo.py", line 55, in _quantize_model
    atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/model_quant.py", line 114, in quantize
    calibrate(model, config["algorithm"], forward_loop=forward_loop)
  File "ammo/torch/quantization/model_calib.py", line 60, in ammo.torch.quantization.model_calib.calibrate
  File "ammo/torch/quantization/model_calib.py", line 182, in ammo.torch.quantization.model_calib.awq
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "ammo/torch/quantization/model_calib.py", line 306, in ammo.torch.quantization.model_calib.awq_lite
  File "/app/tensorrt_llm/tensorrt_llm/models/quantized/ammo.py", line 52, in calibrate_loop
    model(data)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 820, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 708, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "ammo/torch/quantization/model_calib.py", line 272, in ammo.torch.quantization.model_calib.awq_lite.forward
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/quant_module.py", line 57, in forward
    self.__dict__["weight"] = self.weight_quantizer(self.weight)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py", line 501, in forward
    outputs = self._quant_forward(inputs)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py", line 343, in _quant_forward
    outputs = fake_tensor_quant(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/tensor_quant.py", line 357, in forward
    outputs = cuda_ext.fake_tensor_quant_with_axis(
AttributeError: 'NoneType' object has no attribute 'fake_tensor_quant_with_axis'
jdemouth-nvidia commented 1 year ago

Hi @lynkz-matt-psaltis ,

Thanks for reporting that issue. I'm going to forward the issue to the AMMO team. I'll let you know when they have some feedback about it.

Thanks, Julien

RalphMao commented 1 year ago

Hi @lynkz-matt-psaltis , this issue happens when the pre-compiled cuda extension is not compatible with the host CUDA/torch versions.

In your case, we suggest you find the source wheel files in the ammo tarball (those without cuxxx in the wheel name), which does compilation on the fly. Also see: https://github.com/NVIDIA/TensorRT-LLM/issues/126

lynkz-matt-psaltis commented 1 year ago

Awesome thanks so much for that team! @RalphMao & @jdemouth-nvidia