Qwen1.5-32B python benckmark: satisfyProfile. Set dimension [10240] for tensor input_ids does not satisfy any optimization profiles.

cqli0905 commented 3 months ago

System Info

CPU Arch: x86_64 GPU: A100 TensorRT-llm: 0.11.0.dev2024061800 cuda version: 12.1 os ubuntu22.04 docker: nvidia/cuda:12.1.0-devel-ubuntu22.04

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Step 1: Download Qwen-32B model files

cd TensorRT-LLM/examples/qwen git clone https://www.modelscope.cn/qwen/Qwen1.5-32B-Chat.git ./tmp/Qwen/32B

Step 2：Generate engine files

python3 convert_checkpoint.py --model_dir ./tmp/Qwen/32B/ --output_dir ./tllm_ckpt_tp4 --dtype float16 --tp_size 4 --weight_only_precision int8 trtllm-build --checkpoint_dir ./tllm_ckpt_tp4/ --output_dir ./tmp/qwen/32b/trt_engines/fp16/4-gpu/ --gemm_plugin float16 --max_batch_size 64 --max_input_len 512 --max_output_len 750

Step3: Modify the allowed_configs.py file to support the qwen1.5-32b model.

cd TensorRT-LLM/benchmarks/python add following codes for allowed_configs.py file

"qwen1.5_32b_chat":
    ModelConfig(name="qwen1.5_32b_chat",
                family="qwen2",
                benchmark_type="gpt",
                build_config=BuildConfig(
                    num_layers=64,
                    num_heads=40,
                    hidden_size=5120,
                    vocab_size=152064,
                    hidden_act='silu',
                    n_positions=32768,
                    inter_size=27392,
                    max_batch_size=64,
                    max_input_len=512,
                    max_seq_len=712,
                    builder_opt=None,
                )),

Step4: Run python benchmark script

cd TensorRT-LLM/benchmarks/python mpirun -n 4 --allow-run-as-root python3 benchmark.py -m qwen1.5_32b_chat --mode=plugin --bat ch_size="20;24" --input_output_len="512,240" --engine_dir=/home/TensorRT-LLM/examples/qwen/tmp/qwen/32b/trt_engines/fp16/4-gpu

Expected behavior

[BENCHMARK] model_name qwen1.5_14b_chat world_size 4 num_heads 40 num_kv_heads 8 num_layers 64 hidden_size 5120 vocab_size 152064 precision float16 batch_size 16 gpu_weights_percent 1.0 input_length 512 output_length 240 gpu_peak_mem(gb) 0.0 build_time(s) 0 tokens_per_sec 745.92 percentile95(ms) 5155.022 percentile99(ms) 5155.022 latency(ms) 5148.002 compute_cap sm80 quantization QuantMode.0 generation_time(ms) 4454.605 total_generated_tokens 3824.0 generation_tokens_per_second 858.438

actual behavior

[06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor input_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor input_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor input_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor input_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor position_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor position_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor position_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor position_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::enqueueV3: Error Code 3: API Usage Error (Parameter check failed, condition: inputDimensionSpecified && inputShapesSpecified. Not all shapes are specified. Following input tensors' dimensions are not specified: input_ids, position_ids.)

additional notes

NA

nv-guomingz commented 3 months ago

Let me try to reproduce it firstly.

nv-guomingz commented 3 months ago

@cqli0905 I think below commands need to add --use_weight_only knob. If we only specify the weight_only_precision, we can't generate the low-precision checkpoints. python python3 convert_checkpoint.py --model_dir ./tmp/Qwen/32B/ --output_dir ./tllm_ckpt_tp4 --dtype float16 --tp_size 4 --weight_only_precision int8 change to python python3 convert_checkpoint.py --model_dir ./tmp/Qwen/32B/ --output_dir ./tllm_ckpt_tp4 --dtype float16 --tp_size 4 --weight_only_precision int8 --use_weight_only `

cqli0905 commented 3 months ago

The problem still exists.

nv-guomingz commented 3 months ago

I can reproduce this issue and we've filed one bug to track it.

cqli0905 commented 2 months ago

Setting a larger value for the max_num_tokens field when using the trtllm-build command to generate the engine can solve this problem. @nv-guomingz

cqli0905 commented 2 months ago

Setting a larger value for the max_num_tokens field when using the trtllm-build command to generate the engine can solve this problem. @nv-guomingz

However, doing so will result in the following warning from trtllm： [07/05/2024-09:43:11] [TRT-LLM] [W] Specifying amax_num_tokenslarger than 16384 is usually not recommended, we do not expect perf gain with that and too largemax_num_tokenscould possibly exceed the TensorRT tensor volume, causing runtime errors. Gotmax_num_tokens= 65536

NVIDIA / TensorRT-LLM