Closed cqli0905 closed 2 months ago
Let me try to reproduce it firstly.
@cqli0905 I think below commands need to add --use_weight_only
knob. If we only specify the weight_only_precision, we can't generate the low-precision checkpoints.
python python3 convert_checkpoint.py --model_dir ./tmp/Qwen/32B/ --output_dir ./tllm_ckpt_tp4 --dtype float16 --tp_size 4 --weight_only_precision int8
change to
python python3 convert_checkpoint.py --model_dir ./tmp/Qwen/32B/ --output_dir ./tllm_ckpt_tp4 --dtype float16 --tp_size 4 --weight_only_precision int8 --use_weight_only
`
The problem still exists.
I can reproduce this issue and we've filed one bug to track it.
Setting a larger value for the max_num_tokens field when using the trtllm-build command to generate the engine can solve this problem. @nv-guomingz
Setting a larger value for the max_num_tokens field when using the trtllm-build command to generate the engine can solve this problem. @nv-guomingz
However, doing so will result in the following warning from trtllm:
[07/05/2024-09:43:11] [TRT-LLM] [W] Specifying a
max_num_tokenslarger than 16384 is usually not recommended, we do not expect perf gain with that and too large
max_num_tokenscould possibly exceed the TensorRT tensor volume, causing runtime errors. Got
max_num_tokens= 65536
System Info
CPU Arch: x86_64 GPU: A100 TensorRT-llm: 0.11.0.dev2024061800 cuda version: 12.1 os ubuntu22.04 docker: nvidia/cuda:12.1.0-devel-ubuntu22.04
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Step 1: Download Qwen-32B model files
cd TensorRT-LLM/examples/qwen git clone https://www.modelscope.cn/qwen/Qwen1.5-32B-Chat.git ./tmp/Qwen/32B
Step 2:Generate engine files
python3 convert_checkpoint.py --model_dir ./tmp/Qwen/32B/ --output_dir ./tllm_ckpt_tp4 --dtype float16 --tp_size 4 --weight_only_precision int8 trtllm-build --checkpoint_dir ./tllm_ckpt_tp4/ --output_dir ./tmp/qwen/32b/trt_engines/fp16/4-gpu/ --gemm_plugin float16 --max_batch_size 64 --max_input_len 512 --max_output_len 750
Step3: Modify the allowed_configs.py file to support the qwen1.5-32b model.
cd TensorRT-LLM/benchmarks/python add following codes for allowed_configs.py file
Step4: Run python benchmark script
cd TensorRT-LLM/benchmarks/python mpirun -n 4 --allow-run-as-root python3 benchmark.py -m qwen1.5_32b_chat --mode=plugin --bat ch_size="20;24" --input_output_len="512,240" --engine_dir=/home/TensorRT-LLM/examples/qwen/tmp/qwen/32b/trt_engines/fp16/4-gpu
Expected behavior
[BENCHMARK] model_name qwen1.5_14b_chat world_size 4 num_heads 40 num_kv_heads 8 num_layers 64 hidden_size 5120 vocab_size 152064 precision float16 batch_size 16 gpu_weights_percent 1.0 input_length 512 output_length 240 gpu_peak_mem(gb) 0.0 build_time(s) 0 tokens_per_sec 745.92 percentile95(ms) 5155.022 percentile99(ms) 5155.022 latency(ms) 5148.002 compute_cap sm80 quantization QuantMode.0 generation_time(ms) 4454.605 total_generated_tokens 3824.0 generation_tokens_per_second 858.438
actual behavior
[06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor input_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor input_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor input_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor input_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor position_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor position_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor position_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [10240] for tensor position_ids does not satisfy any optimization profiles. Valid range for profile 0: [1]..[8192].) [06/24/2024-06:51:13] [TRT] [E] IExecutionContext::enqueueV3: Error Code 3: API Usage Error (Parameter check failed, condition: inputDimensionSpecified && inputShapesSpecified. Not all shapes are specified. Following input tensors' dimensions are not specified: input_ids, position_ids.)
additional notes
NA