NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

tritonserver is 40x slower than `TensorRT-LLM/examples/run.py` #2435

Closed ShuaiShao93 closed 1 week ago

ShuaiShao93 commented 1 week ago

System Info

L4 GPU, Debian 11 OS, CUDA 12.4

Who can help?

@kaiyux

Information

Tasks

Reproduction

  1. git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
  2. pip3 install tensorrt_llm==0.14.0 --extra-index-url https://pypi.nvidia.com/
  3. git clone -b v0.14.0 https://github.com/NVIDIA/TensorRT-LLM.git
  4. python TensorRT-LLM/examples/quantization/quantize.py --model_dir ./Meta-Llama-3.1-8B-Instruct --dtype float16 --qformat int4_awq --batch_size 8 --awq_block_size 128 --output_dir ./tllm_checkpoint_1gpu_int4_awq --calib_size 32
  5. trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int4_awq --output_dir ./tmp/llama/8B/trt_engines/int4_awq/1-gpu --gpt_attention_plugin auto --gemm_plugin auto --max_num_tokens 32768 --max_batch_size 8 --logits_dtype=float32 --gather_generation_logits --kv_cache_type=disabled
  6. python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --run_profiling --tokenizer_dir ./Meta-Llama-3.1-8B-Instruct --input_file long_input.txt (long_input.txt is a prompt with 16k tokens)
  7. Get 0.27s on L4 GPU
  8. git clone -b r24.10 https://github.com/triton-inference-server/tensorrtllm_backend.git
  9. cp ./tmp/llama/8B/trt_engines/int4_awq/1-gpu/* tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/
  10. 
    HF_LLAMA_MODEL=./Meta-Llama-3.1-8B-Instruct 
    ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
    BATCH_SIZE=8 

cd tensorrtllm_backend python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:${BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:${BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:${BATCH_SIZE} python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},exclude_input_in_output:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0


11. cd ..
12. docker run -it --rm --gpus all --network host --shm-size=1g \
-v $(pwd):/workspace \
--workdir /workspace \
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3

13. python3 tensorrtllm_backend/scripts/launch_triton_server.py --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1
14. Use the same long_input.txt on `ensemble`, also set max_output_len to 1
15. The latency is 11s

### Expected behavior

The latency should be close between the two

### actual behavior

One is 0.27s, one is 11s

### additional notes

N/A
ShuaiShao93 commented 1 week ago

Damn, there is flag max_input_length in TensorRT-LLM/examples/run.py