Open lishicheng1996 opened 3 weeks ago
Hi @lishicheng1996, does it only happen in w4a8_awq mode for Qwen model? Could you please try again without the padding operation?
BTW, i opened the link of padding scripts you provided but found nothing.
Thanks to your reply! I modified the link above to the issue contain padding scripts. The reason to padding is that I want run 2 cards TP on 4090 for this model. The internel kernel need tensor size to be Nx128. The intermediate_size of qwen2-72b is 29568, which is 115.52(TP)128. So I have to padding it to build engine.
I think it's caused by padding weights with zero.
Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected.
You can try to pad the weights like this:
torch.randn([pad_size, shape_list[1]], dtype=value.dtype) * 0.001
There may be some better ways, but just an example for you:)
I think it's caused by padding weights with zero.
Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected.
You can try to pad the weights like this:
torch.randn([pad_size, shape_list[1]], dtype=value.dtype) * 0.001
There may be some better ways, but just an example for you:)
Thanks for your help, I’ll try it ^_^
System Info
GPU: 4090 Tensorrt: 10.3 tensorrt-llm: 0.13.0.dev2024081300
Who can help?
@Tracin May you please have a look, thank you very much
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi, I tried Qwen2-72B w4a8 quantization, but got empty output. I do it with following steps:
Padding
Following script here #1833
Quantization Command:
python TensorRT-LLM/examples/quantization/quantize.py --model_dir Qwen2-72B-Instruct-padding/ --qformat w4a8_awq --output_dir w4a8_ckpt
Build Engine
trtllm-build --checkpoint_dir w4a8_ckpt --output_dir w4a8_engine --gemm_plugin auto
Test output
python TensorRT-LLM/examples/run.py --max_output_len=50 --tokenizer_dir ./Qwen2-72B-Instruct-padding/ --engine_dir=w4a8_engine
Expected behavior
Normally generate outputs like fp8 or int4_awq
actual behavior
Empty outpus
additional notes
None