problem with tensorrt_llm performance

System Info

hi,

i generated the tensorrt llm engine for a llama based model and see that the performance is much worse than vllm.

i did the following:

compile model with tensorrt llm compiler
configure the triton inference server repo
- configure inflight batching for tensorrt llm
start triton inference llm server
benchmark to compare tensorrt llm with vllm

questions:

is there a problem in the tensorrt llm engine build process?
how can i re-configure tensorrt llm to get better latency vs. throughput numbers similar to vllm?
i also tried to set max_batch_size to 1 - it doesnt really change anything... - any idea?
if you tell me to use the latest tensorrt_llm package - what nvcr.io/nvidia/tritonserver image should i use for the tensorrt llm engine generation process and what image for tritonserver doing inference?

setup:

Used Image to compile the engine and run triton inference server: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
Used Tensorrt llm version: 0.10.0 - included in the image above
GPU name: 1 x Nvidia A10
GPU memory: 24 gigabytes (GB)
LLM: Meta-Llama-Guard-2-8B

used gpu:

nvidia-smi
Thu Jul 11 23:51:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   49C    P0              70W / 300W |  16834MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     74156      C   tritonserver                              16824MiB |
+---------------------------------------------------------------------------------------+

build tensorrt llm engine and create triton repo: create_trt_engine.txt

started triton inference server and triton inference server model configs: start_triton_inference.txt

benchmark triton inference:

2024/07/11 23:37:56 ============ Serving Benchmark Result ============
2024/07/11 23:37:56 Benchmark Duration (sec): 120.10
2024/07/11 23:37:56 Number of total requests: 362
2024/07/11 23:37:56 Success Rate (Percent): 100.00
2024/07/11 23:37:56 Concurrency: 1
2024/07/11 23:37:56 Request throughput (req/sec): 3.014
2024/07/11 23:37:56 Prompt throughput (tokens/second) avg: 2224.493
2024/07/11 23:37:56 Generation throughput (tokens/second) avg: 12.057
2024/07/11 23:37:56 End to End Latency (ms) avg: 328.207
2024/07/11 23:37:56 End to End Latency (ms) p50: 328.000
2024/07/11 23:37:56 End to End Latency (ms) p90: 329.000

2024/07/11 23:37:56 Running load test with concurrency 5...
2024/07/11 23:39:57 ============ Serving Benchmark Result ============
2024/07/11 23:39:57 Benchmark Duration (sec): 121.45
2024/07/11 23:39:57 Number of total requests: 375
2024/07/11 23:39:57 Success Rate (Percent): 100.00
2024/07/11 23:39:57 Concurrency: 5
2024/07/11 23:39:57 Request throughput (req/sec): 3.088
2024/07/11 23:39:57 Prompt throughput (tokens/second) avg: 2278.757
2024/07/11 23:39:57 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:39:57 End to End Latency (ms) avg: 1607.147
2024/07/11 23:39:57 End to End Latency (ms) p50: 1616.000
2024/07/11 23:39:57 End to End Latency (ms) p90: 1617.000

2024/07/11 23:39:57 Running load test with concurrency 10...
2024/07/11 23:42:00 ============ Serving Benchmark Result ============
2024/07/11 23:42:00 Benchmark Duration (sec): 123.06
2024/07/11 23:42:00 Number of total requests: 380
2024/07/11 23:42:00 Success Rate (Percent): 100.00
2024/07/11 23:42:00 Concurrency: 10
2024/07/11 23:42:00 Request throughput (req/sec): 3.088
2024/07/11 23:42:00 Prompt throughput (tokens/second) avg: 2278.813
2024/07/11 23:42:00 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:42:00 End to End Latency (ms) avg: 3196.500
2024/07/11 23:42:00 End to End Latency (ms) p50: 3235.000
2024/07/11 23:42:00 End to End Latency (ms) p90: 3236.000

2024/07/11 23:42:00 Running load test with concurrency 20...
2024/07/11 23:44:07 ============ Serving Benchmark Result ============
2024/07/11 23:44:07 Benchmark Duration (sec): 126.30
2024/07/11 23:44:07 Number of total requests: 390
2024/07/11 23:44:07 Success Rate (Percent): 100.00
2024/07/11 23:44:07 Concurrency: 20
2024/07/11 23:44:07 Request throughput (req/sec): 3.088
2024/07/11 23:44:07 Prompt throughput (tokens/second) avg: 2278.796
2024/07/11 23:44:07 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:44:07 End to End Latency (ms) avg: 6315.615
2024/07/11 23:44:07 End to End Latency (ms) p50: 6473.000
2024/07/11 23:44:07 End to End Latency (ms) p90: 6474.000

2024/07/11 23:44:07 Running load test with concurrency 30...
2024/07/11 23:46:16 ============ Serving Benchmark Result ============
2024/07/11 23:46:16 Benchmark Duration (sec): 129.54
2024/07/11 23:46:16 Number of total requests: 400
2024/07/11 23:46:16 Success Rate (Percent): 100.00
2024/07/11 23:46:16 Concurrency: 30
2024/07/11 23:46:16 Request throughput (req/sec): 3.088
2024/07/11 23:46:16 Prompt throughput (tokens/second) avg: 2278.771
2024/07/11 23:46:16 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:46:16 End to End Latency (ms) avg: 9359.320
2024/07/11 23:46:16 End to End Latency (ms) p50: 9712.000
2024/07/11 23:46:16 End to End Latency (ms) p90: 9713.000

2024/07/11 23:46:16 Running load test with concurrency 40...
2024/07/11 23:48:29 ============ Serving Benchmark Result ============
2024/07/11 23:48:29 Benchmark Duration (sec): 132.79
2024/07/11 23:48:29 Number of total requests: 410
2024/07/11 23:48:29 Success Rate (Percent): 100.00
2024/07/11 23:48:29 Concurrency: 40
2024/07/11 23:48:29 Request throughput (req/sec): 3.088
2024/07/11 23:48:29 Prompt throughput (tokens/second) avg: 2278.700
2024/07/11 23:48:29 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:48:29 End to End Latency (ms) avg: 12334.346
2024/07/11 23:48:29 End to End Latency (ms) p50: 12950.000
2024/07/11 23:48:29 End to End Latency (ms) p90: 12951.000

2024/07/11 23:48:29 Running load test with concurrency 50...
2024/07/11 23:50:45 ============ Serving Benchmark Result ============
2024/07/11 23:50:45 Benchmark Duration (sec): 136.02
2024/07/11 23:50:45 Number of total requests: 420
2024/07/11 23:50:45 Success Rate (Percent): 100.00
2024/07/11 23:50:45 Concurrency: 50
2024/07/11 23:50:45 Request throughput (req/sec): 3.088
2024/07/11 23:50:45 Prompt throughput (tokens/second) avg: 2278.776
2024/07/11 23:50:45 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:50:45 End to End Latency (ms) avg: 15243.845
2024/07/11 23:50:45 End to End Latency (ms) p50: 16188.000
2024/07/11 23:50:45 End to End Latency (ms) p90: 16189.000

deploy vllm container:

#!/bin/bash

MODEL="meta-llama/Meta-Llama-Guard-2-8B"
HUGGING_FACE_HUB_TOKEN="xxx"

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
    -p 8080:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model ${MODEL} \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --kv-cache-dtype auto \
    --enable-prefix-caching \
    --max-num-batched-tokens 8192

start vllm container:

./deploy.sh 
Unable to find image 'vllm/vllm-openai:latest' locally
latest: Pulling from vllm/vllm-openai
3c645031de29: Pull complete 
0d6448aff889: Pull complete 
0a7674e3e8fe: Pull complete 
b71b637b97c5: Pull complete 
56dc85502937: Pull complete 
380ca03515b9: Pull complete 
b9e353cd3958: Pull complete 
57efca880186: Pull complete 
2735a04f6870: Pull complete 
175b4b06144d: Pull complete 
5dc5ca7a92cf: Pull complete 
203e66f482bf: Pull complete 
Digest: sha256:e58fceffa6f8d3e4d535f9e7128361cd33469b232a8dc670967b62ae62bac5fe
Status: Downloaded newer image for vllm/vllm-openai:latest
INFO 07-12 00:08:12 api_server.py:206] vLLM API server version 0.5.1
INFO 07-12 00:08:12 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-Guard-2-8B', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=8192, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-12 00:08:12 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='meta-llama/Meta-Llama-Guard-2-8B', speculative_config=None, tokenizer='meta-llama/Meta-Llama-Guard-2-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-Guard-2-8B, use_v2_block_manager=False, enable_prefix_caching=True)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-12 00:08:13 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 07-12 00:10:15 model_runner.py:255] Loading model weights took 14.9595 GB
INFO 07-12 00:10:17 gpu_executor.py:84] # GPU blocks: 1639, # CPU blocks: 2048
INFO 07-12 00:10:18 model_runner.py:924] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-12 00:10:18 model_runner.py:928] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-12 00:10:31 model_runner.py:1117] Graph capturing finished in 12 secs.
INFO 07-12 00:10:31 block_manager_v1.py:247] Automatic prefix caching is enabled.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-12 00:10:31 serving_chat.py:94] Using default chat template:
INFO 07-12 00:10:31 serving_chat.py:94] {% if messages|length % 2 == 0 %}{% set role = 'Agent' %}{% else %}{% set role = 'User' %}{% endif %}{{ "[INST] Task: Check if there is unsafe content in '" + role + "' messages in conversations according our safety policy with the below categories.
INFO 07-12 00:10:31 serving_chat.py:94] 
INFO 07-12 00:10:31 serving_chat.py:94] <BEGIN UNSAFE CONTENT CATEGORIES>
INFO 07-12 00:10:31 serving_chat.py:94] S1: Violent Crimes. 
INFO 07-12 00:10:31 serving_chat.py:94] S2: Non-Violent Crimes. 
INFO 07-12 00:10:31 serving_chat.py:94] S3: Sex Crimes. 
INFO 07-12 00:10:31 serving_chat.py:94] S4: Child Exploitation. 
INFO 07-12 00:10:31 serving_chat.py:94] S5: Specialized Advice. 
INFO 07-12 00:10:31 serving_chat.py:94] S6: Privacy. 
INFO 07-12 00:10:31 serving_chat.py:94] S7: Intellectual Property. 
INFO 07-12 00:10:31 serving_chat.py:94] S8: Indiscriminate Weapons. 
INFO 07-12 00:10:31 serving_chat.py:94] S9: Hate. 
INFO 07-12 00:10:31 serving_chat.py:94] S10: Self-Harm. 
INFO 07-12 00:10:31 serving_chat.py:94] S11: Sexual Content. 
INFO 07-12 00:10:31 serving_chat.py:94] <END UNSAFE CONTENT CATEGORIES>
INFO 07-12 00:10:31 serving_chat.py:94] 
INFO 07-12 00:10:31 serving_chat.py:94] <BEGIN CONVERSATION>
INFO 07-12 00:10:31 serving_chat.py:94] 
INFO 07-12 00:10:31 serving_chat.py:94] " }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{% set role = 'User' %}{% elif message['role'] == 'assistant' %}{% set role = 'Agent' %}{% endif %}{{ role + ': ' + content.strip() + '
INFO 07-12 00:10:31 serving_chat.py:94] 
INFO 07-12 00:10:31 serving_chat.py:94] ' }}{% endfor %}{{ "<END CONVERSATION>
INFO 07-12 00:10:31 serving_chat.py:94] 
INFO 07-12 00:10:31 serving_chat.py:94] Provide your safety assessment for ONLY THE LAST " + role + " message in the above conversation:
INFO 07-12 00:10:31 serving_chat.py:94]  - First line must read 'safe' or 'unsafe'.
INFO 07-12 00:10:31 serving_chat.py:94]  - If unsafe, a second line must include a comma-separated list of violated categories. [/INST]" }}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-12 00:10:32 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

benchmark vllm:

2024/07/12 02:23:54 Running load test with concurrency 1...
2024/07/12 02:25:54 ============ Serving Benchmark Result ============
2024/07/12 02:25:54 Benchmark Duration (sec): 120.18
2024/07/12 02:25:54 Number of total requests: 656
2024/07/12 02:25:54 Success Rate (Percent): 100.00
2024/07/12 02:25:54 Concurrency: 1
2024/07/12 02:25:54 Request throughput (req/sec): 5.459
2024/07/12 02:25:54 Prompt throughput (tokens/second) avg: 32.751
2024/07/12 02:25:54 Generation throughput (tokens/second) avg: 21.834
2024/07/12 02:25:54 End to End Latency (ms) avg: 182.445
2024/07/12 02:25:54 End to End Latency (ms) p50: 182.000
2024/07/12 02:25:54 End to End Latency (ms) p90: 183.000

2024/07/12 02:25:54 Running load test with concurrency 5...
2024/07/12 02:27:55 ============ Serving Benchmark Result ============
2024/07/12 02:27:55 Benchmark Duration (sec): 120.07
2024/07/12 02:27:55 Number of total requests: 2425
2024/07/12 02:27:55 Success Rate (Percent): 100.00
2024/07/12 02:27:55 Concurrency: 5
2024/07/12 02:27:55 Request throughput (req/sec): 20.196
2024/07/12 02:27:55 Prompt throughput (tokens/second) avg: 121.179
2024/07/12 02:27:55 Generation throughput (tokens/second) avg: 80.786
2024/07/12 02:27:55 End to End Latency (ms) avg: 246.920
2024/07/12 02:27:55 End to End Latency (ms) p50: 247.000
2024/07/12 02:27:55 End to End Latency (ms) p90: 250.000

2024/07/12 02:27:55 Running load test with concurrency 10...
2024/07/12 02:29:55 ============ Serving Benchmark Result ============
2024/07/12 02:29:55 Benchmark Duration (sec): 120.24
2024/07/12 02:29:55 Number of total requests: 4340
2024/07/12 02:29:55 Success Rate (Percent): 100.00
2024/07/12 02:29:55 Concurrency: 10
2024/07/12 02:29:55 Request throughput (req/sec): 36.096
2024/07/12 02:29:55 Prompt throughput (tokens/second) avg: 216.573
2024/07/12 02:29:55 Generation throughput (tokens/second) avg: 144.382
2024/07/12 02:29:55 End to End Latency (ms) avg: 276.402
2024/07/12 02:29:55 End to End Latency (ms) p50: 275.000
2024/07/12 02:29:55 End to End Latency (ms) p90: 282.000

2024/07/12 02:29:55 Running load test with concurrency 20...
2024/07/12 02:31:55 ============ Serving Benchmark Result ============
2024/07/12 02:31:55 Benchmark Duration (sec): 120.01
2024/07/12 02:31:55 Number of total requests: 5760
2024/07/12 02:31:55 Success Rate (Percent): 100.00
2024/07/12 02:31:55 Concurrency: 20
2024/07/12 02:31:55 Request throughput (req/sec): 47.998
2024/07/12 02:31:55 Prompt throughput (tokens/second) avg: 287.985
2024/07/12 02:31:55 Generation throughput (tokens/second) avg: 191.990
2024/07/12 02:31:55 End to End Latency (ms) avg: 416.056
2024/07/12 02:31:55 End to End Latency (ms) p50: 353.000
2024/07/12 02:31:55 End to End Latency (ms) p90: 668.000

2024/07/12 02:31:55 Running load test with concurrency 30...
2024/07/12 02:33:55 ============ Serving Benchmark Result ============
2024/07/12 02:33:55 Benchmark Duration (sec): 120.15
2024/07/12 02:33:55 Number of total requests: 7170
2024/07/12 02:33:55 Success Rate (Percent): 100.00
2024/07/12 02:33:55 Concurrency: 30
2024/07/12 02:33:55 Request throughput (req/sec): 59.675
2024/07/12 02:33:55 Prompt throughput (tokens/second) avg: 358.051
2024/07/12 02:33:55 Generation throughput (tokens/second) avg: 238.701
2024/07/12 02:33:55 End to End Latency (ms) avg: 502.089
2024/07/12 02:33:55 End to End Latency (ms) p50: 474.000
2024/07/12 02:33:55 End to End Latency (ms) p90: 502.000

2024/07/12 02:33:55 Running load test with concurrency 40...
2024/07/12 02:35:55 ============ Serving Benchmark Result ============
2024/07/12 02:35:55 Benchmark Duration (sec): 120.51
2024/07/12 02:35:55 Number of total requests: 8280
2024/07/12 02:35:55 Success Rate (Percent): 100.00
2024/07/12 02:35:55 Concurrency: 40
2024/07/12 02:35:55 Request throughput (req/sec): 68.707
2024/07/12 02:35:55 Prompt throughput (tokens/second) avg: 412.240
2024/07/12 02:35:55 Generation throughput (tokens/second) avg: 274.827
2024/07/12 02:35:55 End to End Latency (ms) avg: 581.544
2024/07/12 02:35:55 End to End Latency (ms) p50: 573.000
2024/07/12 02:35:55 End to End Latency (ms) p90: 586.000

2024/07/12 02:35:55 Running load test with concurrency 50...
2024/07/12 02:37:56 ============ Serving Benchmark Result ============
2024/07/12 02:37:56 Benchmark Duration (sec): 120.16
2024/07/12 02:37:56 Number of total requests: 8600
2024/07/12 02:37:56 Success Rate (Percent): 100.00
2024/07/12 02:37:56 Concurrency: 50
2024/07/12 02:37:56 Request throughput (req/sec): 71.573
2024/07/12 02:37:56 Prompt throughput (tokens/second) avg: 429.441
2024/07/12 02:37:56 Generation throughput (tokens/second) avg: 286.294
2024/07/12 02:37:56 End to End Latency (ms) avg: 697.933
2024/07/12 02:37:56 End to End Latency (ms) p50: 689.000
2024/07/12 02:37:56 End to End Latency (ms) p90: 707.000

Who can help?

@hijkzzz @Tracin @yuxianq @Njuapp @uppalutkarsh @nv-guomingz

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

is all in code above.

Expected behavior

better performance for concurrent requests and similar performance to vllm

actual behavior

performance degration

NVIDIA / TensorRT-LLM