tritonserver is 40x slower than `TensorRT-LLM/examples/run.py`

System Info

L4 GPU, Debian 11 OS, CUDA 12.4

Who can help?

@kaiyux

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
pip3 install tensorrt_llm==0.14.0 --extra-index-url https://pypi.nvidia.com/
git clone -b v0.14.0 https://github.com/NVIDIA/TensorRT-LLM.git
python TensorRT-LLM/examples/quantization/quantize.py --model_dir ./Meta-Llama-3.1-8B-Instruct --dtype float16 --qformat int4_awq --batch_size 8 --awq_block_size 128 --output_dir ./tllm_checkpoint_1gpu_int4_awq --calib_size 32
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int4_awq --output_dir ./tmp/llama/8B/trt_engines/int4_awq/1-gpu --gpt_attention_plugin auto --gemm_plugin auto --max_num_tokens 32768 --max_batch_size 8 --logits_dtype=float32 --gather_generation_logits --kv_cache_type=disabled
python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file long_input.txt (long_input.txt is a prompt with 16k tokens)
Get 0.27s on L4 GPU
git clone -b r24.10 https://github.com/triton-inference-server/tensorrtllm_backend.git
cp ./tmp/llama/8B/trt_engines/int4_awq/1-gpu/* tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/


HF_LLAMA_MODEL=./Meta-Llama-3.1-8B-Instruct 
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
BATCH_SIZE=8

cd tensorrtllm_backend python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:${BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:${BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:${BATCH_SIZE} python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},exclude_input_in_output:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0


11. cd ..
12. docker run -it --rm --gpus all --network host --shm-size=1g \
-v $(pwd):/workspace \
--workdir /workspace \
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3

13. python3 tensorrtllm_backend/scripts/launch_triton_server.py --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1
14. Use the same long_input.txt on `ensemble`, also set max_output_len to 1
15. The latency is 11s

### Expected behavior

The latency should be close between the two

### actual behavior

One is 0.27s, one is 11s

### additional notes

N/A

NVIDIA / TensorRT-LLM