NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.79k stars 1.01k forks source link

TensorRT Quantization Breaks for `LlamaLinearScalingRotaryEmbedding` #1083

Open Sanger2000 opened 9 months ago

Sanger2000 commented 9 months ago

System Info

NVIDIA 4090 TensorRT-0.7.1

In nvidia-ammo, it appears these lines in ammo/torch/export/layer_utils.py have an unexpected failure for some Llama variants:

303903836-2d6ffec3-3908-4c08-a269-680904ccbd28

In particular, the deepseek models use LlamaLinearScalingRotaryEmbedding. This means the module is picked up by the is_linear check, and is treated as the dense case. However, there is no .weight for this module, so the build_linear_config fails.

Lots of easy fixes for this (for example, just checking if "Rotary" in name and skipping that case), happy to contribute (but don't think there is an OSS repo to do so)

Who can help?

@Tracin

Information

Tasks

Reproduction

Try compiling then running on fp8 for deepseek-coder-6.7b-base

Expected behavior

I expect the model to generate the tokens

actual behavior

The code throws the error: "no .weight for this module"

additional notes

N/A

shatealaboxiaowang commented 8 months ago

Is there a solution ? i have the same problem.

activezhao commented 8 months ago

@Sanger2000 I have the same problem with deepseek-coder-6.7b-base model, have you solved the problem?

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
                --dtype bfloat16 \
                --qformat int4_awq \
                --batch_size 8 \
                --tp_size 2 \
                --awq_block_size 128 \
                --output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
                --calib_size 32

................................................................

/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Quantization done. Total time used: 65.55 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /data/deepseek-coder-6.7b-base-int4-awq-tp2/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 945, in build_decoder_config
    config.attention = build_attention_config(layer, model_metadata_config, dtype, config)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 638, in build_attention_config
    config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 581, in build_linear_config
    torch_weight = module.weight.detach()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Quantized model exported to /data/deepseek-coder-6.7b-base-int4-awq-tp2 
Total time used 10.00 s.
RalphMao commented 8 months ago

Thank you for pointing out this issue. We will add a fix to more robustly distinguish the actual dense linear layer.

silverriver commented 8 months ago

I am facing the save issue with v0.8.0. Help needed.

activezhao commented 8 months ago

Thank you for pointing out this issue. We will add a fix to more robustly distinguish the actual dense linear layer.

Hi @RalphMao Are there any temporary ways to avoid this problem now?

Opdoop commented 7 months ago

@activezhao A hotfix would be modify the is_linear function to skip 'Rotary' layer.

def is_linear(module: nn.Module) -> bool:
    """Returns whether the module is a linear layer."""
    return any([k in type(module).__name__ for k in ["Linear", "Conv1D", "NormHead"]]) and ("Rotary" not in type(module).__name__)
activezhao commented 7 months ago

@activezhao A hotfix would be modify the is_linear function to skip 'Rotary' layer.

def is_linear(module: nn.Module) -> bool:
    """Returns whether the module is a linear layer."""
    return any([k in type(module).__name__ for k in ["Linear", "Conv1D", "NormHead"]]) and ("Rotary" not in type(module).__name__)

@Opdoop OK, thanks.

activezhao commented 7 months ago

@Sanger2000 I have the same problem with deepseek-coder-6.7b-base model, have you solved the problem?

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
                --dtype bfloat16 \
                --qformat int4_awq \
                --batch_size 8 \
                --tp_size 2 \
                --awq_block_size 128 \
                --output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
                --calib_size 32

................................................................

/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Quantization done. Total time used: 65.55 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /data/deepseek-coder-6.7b-base-int4-awq-tp2/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 945, in build_decoder_config
    config.attention = build_attention_config(layer, model_metadata_config, dtype, config)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 638, in build_attention_config
    config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 581, in build_linear_config
    torch_weight = module.weight.detach()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Quantized model exported to /data/deepseek-coder-6.7b-base-int4-awq-tp2 
Total time used 10.00 s.

Hi @Opdoop I have a question

if I set --qformatto fp8 in quantize.py, are the Weight and Activation Function both fp8?

Thanks

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
                --dtype bfloat16 \
                --qformat int4_awq \
                --batch_size 8 \
                --tp_size 2 \
                --awq_block_size 128 \
                --output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
                --calib_size 32
hello-11 commented 2 weeks ago

@Sanger2000 Do you still have the problem? If not, we will close it soon.