Open Sanger2000 opened 9 months ago
Is there a solution ? i have the same problem.
@Sanger2000 I have the same problem with deepseek-coder-6.7b-base model, have you solved the problem?
python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
--dtype bfloat16 \
--qformat int4_awq \
--batch_size 8 \
--tp_size 2 \
--awq_block_size 128 \
--output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
--calib_size 32
................................................................
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.register_buffer("_pre_quant_scale", torch.tensor(value))
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Quantization done. Total time used: 65.55 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /data/deepseek-coder-6.7b-base-int4-awq-tp2/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
for model_config in torch_to_model_config(
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 945, in build_decoder_config
config.attention = build_attention_config(layer, model_metadata_config, dtype, config)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 638, in build_attention_config
config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 581, in build_linear_config
torch_weight = module.weight.detach()
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Quantized model exported to /data/deepseek-coder-6.7b-base-int4-awq-tp2
Total time used 10.00 s.
Thank you for pointing out this issue. We will add a fix to more robustly distinguish the actual dense linear layer.
I am facing the save issue with v0.8.0. Help needed.
Thank you for pointing out this issue. We will add a fix to more robustly distinguish the actual dense linear layer.
Hi @RalphMao Are there any temporary ways to avoid this problem now?
@activezhao A hotfix would be modify the is_linear
function to skip 'Rotary' layer.
def is_linear(module: nn.Module) -> bool:
"""Returns whether the module is a linear layer."""
return any([k in type(module).__name__ for k in ["Linear", "Conv1D", "NormHead"]]) and ("Rotary" not in type(module).__name__)
@activezhao A hotfix would be modify the
is_linear
function to skip 'Rotary' layer.def is_linear(module: nn.Module) -> bool: """Returns whether the module is a linear layer.""" return any([k in type(module).__name__ for k in ["Linear", "Conv1D", "NormHead"]]) and ("Rotary" not in type(module).__name__)
@Opdoop OK, thanks.
@Sanger2000 I have the same problem with deepseek-coder-6.7b-base model, have you solved the problem?
python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \ --dtype bfloat16 \ --qformat int4_awq \ --batch_size 8 \ --tp_size 2 \ --awq_block_size 128 \ --output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \ --calib_size 32 ................................................................ /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.register_buffer("_pre_quant_scale", torch.tensor(value)) Loading extension ammo_cuda_ext... Loading extension ammo_cuda_ext_fp8... /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). value = torch.tensor(value, device=self._pre_quant_scale.device) /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.register_buffer("_pre_quant_scale", torch.tensor(value)) Calibrating batch 1 Calibrating batch 2 Calibrating batch 3 Quantization done. Total time used: 65.55 s. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. torch.distributed not initialized, assuming single world_size. Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /data/deepseek-coder-6.7b-base-int4-awq-tp2/ammo_model.0.pth using torch.save for further inspection. Detailed export error: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight' Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config for model_config in torch_to_model_config( File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config build_decoder_config(layer, model_metadata_config, decoder_type, dtype) File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 945, in build_decoder_config config.attention = build_attention_config(layer, model_metadata_config, dtype, config) File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 638, in build_attention_config config.dense = build_linear_config(layer, LINEAR_ROW, dtype) File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 581, in build_linear_config torch_weight = module.weight.detach() File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1695, in __getattr__ raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") AttributeError: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight' Quantized model exported to /data/deepseek-coder-6.7b-base-int4-awq-tp2 Total time used 10.00 s.
Hi @Opdoop I have a question
if I set --qformat
to fp8
in quantize.py,
are the Weight
and Activation Function
both fp8?
Thanks
python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
--dtype bfloat16 \
--qformat int4_awq \
--batch_size 8 \
--tp_size 2 \
--awq_block_size 128 \
--output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
--calib_size 32
@Sanger2000 Do you still have the problem? If not, we will close it soon.
System Info
NVIDIA 4090 TensorRT-0.7.1
In nvidia-ammo, it appears these lines in
ammo/torch/export/layer_utils.py
have an unexpected failure for some Llama variants:In particular, the deepseek models use
LlamaLinearScalingRotaryEmbedding
. This means the module is picked up by theis_linear
check, and is treated as the dense case. However, there is no .weight for this module, so thebuild_linear_config
fails.Lots of easy fixes for this (for example, just checking if "Rotary" in name and skipping that case), happy to contribute (but don't think there is an OSS repo to do so)
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Try compiling then running on fp8 for deepseek-coder-6.7b-base
Expected behavior
I expect the model to generate the tokens
actual behavior
The code throws the error: "no .weight for this module"
additional notes
N/A