NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.74k stars 1k forks source link

Qwen2-72B w4a8 empty output #2392

Open lishicheng1996 opened 3 weeks ago

lishicheng1996 commented 3 weeks ago

System Info

GPU: 4090 Tensorrt: 10.3 tensorrt-llm: 0.13.0.dev2024081300

Who can help?

@Tracin May you please have a look, thank you very much

Information

Tasks

Reproduction

Hi, I tried Qwen2-72B w4a8 quantization, but got empty output. I do it with following steps:

Padding

Following script here #1833

Quantization Command:

python TensorRT-LLM/examples/quantization/quantize.py --model_dir Qwen2-72B-Instruct-padding/ --qformat w4a8_awq --output_dir w4a8_ckpt

Build Engine

trtllm-build --checkpoint_dir w4a8_ckpt --output_dir w4a8_engine --gemm_plugin auto

Test output

python TensorRT-LLM/examples/run.py --max_output_len=50 --tokenizer_dir ./Qwen2-72B-Instruct-padding/ --engine_dir=w4a8_engine

Expected behavior

Normally generate outputs like fp8 or int4_awq

actual behavior

Empty outpus

additional notes

None

heyuhhh commented 3 weeks ago

Hi @lishicheng1996, does it only happen in w4a8_awq mode for Qwen model? Could you please try again without the padding operation?

BTW, i opened the link of padding scripts you provided but found nothing.

lishicheng1996 commented 3 weeks ago

Thanks to your reply! I modified the link above to the issue contain padding scripts. The reason to padding is that I want run 2 cards TP on 4090 for this model. The internel kernel need tensor size to be Nx128. The intermediate_size of qwen2-72b is 29568, which is 115.52(TP)128. So I have to padding it to build engine.

heyuhhh commented 2 weeks ago

I think it's caused by padding weights with zero.

Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected.

You can try to pad the weights like this: torch.randn([pad_size, shape_list[1]], dtype=value.dtype) * 0.001

There may be some better ways, but just an example for you:)

lishicheng1996 commented 2 weeks ago

I think it's caused by padding weights with zero.

Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected.

You can try to pad the weights like this: torch.randn([pad_size, shape_list[1]], dtype=value.dtype) * 0.001

There may be some better ways, but just an example for you:)

Thanks for your help, I’ll try it ^_^