NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.24k stars 913 forks source link

The weights compiled by v0.11.0 are run using TritonServer, and the concurrency is lower than that of the v0.10.0 version. The compilation scripts are the same. #2117

Open white-wolf-tech opened 1 month ago

white-wolf-tech commented 1 month ago

System Info

OS: ubuntu22.04 GPU: A100 driver: 550.90.07

Who can help?

No response

Information

Tasks

Reproduction

the following is v0.11.0 build script:

python3 hf_convert_trtllm.py --model_dir $input_dir \
                                --output_dir $input_temp_dir \
                                --dtype float16 \
                                --calib_dataset $calib_dataset_path

    trtllm-build --checkpoint_dir $input_temp_dir \
                --output_dir $output_dir \
                --remove_input_padding enable \
                --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --paged_kv_cache  enable \
                --use_paged_context_fmha enable \
                --use_fused_mlp \
                --context_fmha enable \
                --context_fmha_fp32_acc enable \
                --multi_block_mode enable \
                --nccl_plugin disable \
                --paged_state disable \
                --tokens_per_block 16 \
                --use_custom_all_reduce disable

v0.10.0 is

python3 hf_convert_trtllm.py --model_dir $input_dir \
                                --output_dir $input_temp_dir \
                                --dtype float16 \
                                --calib_dataset $calib_dataset_path

    trtllm-build --checkpoint_dir $input_temp_dir \
                --output_dir $output_dir \
                --remove_input_padding enable \
                --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --paged_kv_cache  enable \
                --use_paged_context_fmha enable \
                --max_batch_size 128 \
                --max_input_len 2048 \
                --max_num_tokens 32768 \
                --tokens_per_block 16 \
                --use_fused_mlp \
                --context_fmha enable \
                --context_fmha_fp32_acc enable \
                --multi_block_mode enable \
                --use_custom_all_reduce disable \
                --strongly_typed

The GPU utilization rate after compilation of v0.11.0 is only 52% - 80%, but the utilization rate of v0.10.0 is stable at 99%. The resulting problem is that the concurrency of v0.11.0 is very low, only 1/2 of that of v0.10.0. What could this problem be? Has anyone encountered it?

Expected behavior

same with v0.10.0 version

actual behavior

slower than v0.10.0 version

additional notes

No

white-wolf-tech commented 1 week ago

The model I'm using is Qwen1.5 - 4B. I have the same problem with V0.12.0. What could be the reason for this?

white-wolf-tech commented 1 week ago

Is this related to the driver? What is the driver configuration of your release version? @kaiyux