TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
11. cd ..
12. docker run -it --rm --gpus all --network host --shm-size=1g \
-v $(pwd):/workspace \
--workdir /workspace \
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
13. python3 tensorrtllm_backend/scripts/launch_triton_server.py --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1
14. Use the same long_input.txt on `ensemble`, also set max_output_len to 1
15. The latency is 11s
### Expected behavior
The latency should be close between the two
### actual behavior
One is 0.27s, one is 11s
### additional notes
N/A
System Info
L4 GPU, Debian 11 OS, CUDA 12.4
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
cd tensorrtllm_backend python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:${BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:${BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:${BATCH_SIZE} python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},exclude_input_in_output:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0