The weights compiled by v0.11.0 are run using TritonServer, and the concurrency is lower than that of the v0.10.0 version. The compilation scripts are the same.

white-wolf-tech commented 1 month ago

System Info

OS: ubuntu22.04 GPU: A100 driver: 550.90.07

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

the following is v0.11.0 build script:

python3 hf_convert_trtllm.py --model_dir $input_dir \
                                --output_dir $input_temp_dir \
                                --dtype float16 \
                                --calib_dataset $calib_dataset_path

    trtllm-build --checkpoint_dir $input_temp_dir \
                --output_dir $output_dir \
                --remove_input_padding enable \
                --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --paged_kv_cache  enable \
                --use_paged_context_fmha enable \
                --use_fused_mlp \
                --context_fmha enable \
                --context_fmha_fp32_acc enable \
                --multi_block_mode enable \
                --nccl_plugin disable \
                --paged_state disable \
                --tokens_per_block 16 \
                --use_custom_all_reduce disable

v0.10.0 is

python3 hf_convert_trtllm.py --model_dir $input_dir \
                                --output_dir $input_temp_dir \
                                --dtype float16 \
                                --calib_dataset $calib_dataset_path

    trtllm-build --checkpoint_dir $input_temp_dir \
                --output_dir $output_dir \
                --remove_input_padding enable \
                --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --paged_kv_cache  enable \
                --use_paged_context_fmha enable \
                --max_batch_size 128 \
                --max_input_len 2048 \
                --max_num_tokens 32768 \
                --tokens_per_block 16 \
                --use_fused_mlp \
                --context_fmha enable \
                --context_fmha_fp32_acc enable \
                --multi_block_mode enable \
                --use_custom_all_reduce disable \
                --strongly_typed

The GPU utilization rate after compilation of v0.11.0 is only 52% - 80%, but the utilization rate of v0.10.0 is stable at 99%. The resulting problem is that the concurrency of v0.11.0 is very low, only 1/2 of that of v0.10.0. What could this problem be? Has anyone encountered it?

Expected behavior

same with v0.10.0 version

actual behavior

slower than v0.10.0 version

additional notes

No

white-wolf-tech commented 1 week ago

The model I'm using is Qwen1.5 - 4B. I have the same problem with V0.12.0. What could be the reason for this?

white-wolf-tech commented 1 week ago

Is this related to the driver? What is the driver configuration of your release version? @kaiyux

NVIDIA / TensorRT-LLM