TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Apache License 2.0
problem with tensorrt_llm performance #1938

opened 2 months ago

commented 2 months ago

System Info


i generated the tensorrt llm engine for a llama based model and see that the performance is much worse than vllm.

i did the following:



Used Image to compile the engine and run triton inference server: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
Used Tensorrt llm version: 0.10.0 - included in the image above
GPU name: 1 x Nvidia A10
GPU memory: 24 gigabytes (GB)
LLM: Meta-Llama-Guard-2-8B

used gpu:

Thu Jul 11 23:51:14 2024       
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   49C    P0              70W / 300W |  16834MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |

| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|    0   N/A  N/A     74156      C   tritonserver                              16824MiB |

build tensorrt llm engine and create triton repo: create_trt_engine.txt

started triton inference server and triton inference server model configs: start_triton_inference.txt

benchmark triton inference:

2024/07/11 23:37:56 ============ Serving Benchmark Result ============
2024/07/11 23:37:56 Benchmark Duration (sec): 120.10
2024/07/11 23:37:56 Number of total requests: 362
2024/07/11 23:37:56 Success Rate (Percent): 100.00
2024/07/11 23:37:56 Concurrency: 1
2024/07/11 23:37:56 Request throughput (req/sec): 3.014
2024/07/11 23:37:56 Prompt throughput (tokens/second) avg: 2224.493
2024/07/11 23:37:56 Generation throughput (tokens/second) avg: 12.057
2024/07/11 23:37:56 End to End Latency (ms) avg: 328.207
2024/07/11 23:37:56 End to End Latency (ms) p50: 328.000
2024/07/11 23:37:56 End to End Latency (ms) p90: 329.000

2024/07/11 23:37:56 Running load test with concurrency 5...
2024/07/11 23:39:57 ============ Serving Benchmark Result ============
2024/07/11 23:39:57 Benchmark Duration (sec): 121.45
2024/07/11 23:39:57 Number of total requests: 375
2024/07/11 23:39:57 Success Rate (Percent): 100.00
2024/07/11 23:39:57 Concurrency: 5
2024/07/11 23:39:57 Request throughput (req/sec): 3.088
2024/07/11 23:39:57 Prompt throughput (tokens/second) avg: 2278.757
2024/07/11 23:39:57 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:39:57 End to End Latency (ms) avg: 1607.147
2024/07/11 23:39:57 End to End Latency (ms) p50: 1616.000
2024/07/11 23:39:57 End to End Latency (ms) p90: 1617.000

2024/07/11 23:39:57 Running load test with concurrency 10...
2024/07/11 23:42:00 ============ Serving Benchmark Result ============
2024/07/11 23:42:00 Benchmark Duration (sec): 123.06
2024/07/11 23:42:00 Number of total requests: 380
2024/07/11 23:42:00 Success Rate (Percent): 100.00
2024/07/11 23:42:00 Concurrency: 10
2024/07/11 23:42:00 Request throughput (req/sec): 3.088
2024/07/11 23:42:00 Prompt throughput (tokens/second) avg: 2278.813
2024/07/11 23:42:00 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:42:00 End to End Latency (ms) avg: 3196.500
2024/07/11 23:42:00 End to End Latency (ms) p50: 3235.000
2024/07/11 23:42:00 End to End Latency (ms) p90: 3236.000

2024/07/11 23:42:00 Running load test with concurrency 20...
2024/07/11 23:44:07 ============ Serving Benchmark Result ============
2024/07/11 23:44:07 Benchmark Duration (sec): 126.30
2024/07/11 23:44:07 Number of total requests: 390
2024/07/11 23:44:07 Success Rate (Percent): 100.00
2024/07/11 23:44:07 Concurrency: 20
2024/07/11 23:44:07 Request throughput (req/sec): 3.088
2024/07/11 23:44:07 Prompt throughput (tokens/second) avg: 2278.796
2024/07/11 23:44:07 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:44:07 End to End Latency (ms) avg: 6315.615
2024/07/11 23:44:07 End to End Latency (ms) p50: 6473.000
2024/07/11 23:44:07 End to End Latency (ms) p90: 6474.000

2024/07/11 23:44:07 Running load test with concurrency 30...
2024/07/11 23:46:16 ============ Serving Benchmark Result ============
2024/07/11 23:46:16 Benchmark Duration (sec): 129.54
2024/07/11 23:46:16 Number of total requests: 400
2024/07/11 23:46:16 Success Rate (Percent): 100.00
2024/07/11 23:46:16 Concurrency: 30
2024/07/11 23:46:16 Request throughput (req/sec): 3.088
2024/07/11 23:46:16 Prompt throughput (tokens/second) avg: 2278.771
2024/07/11 23:46:16 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:46:16 End to End Latency (ms) avg: 9359.320
2024/07/11 23:46:16 End to End Latency (ms) p50: 9712.000
2024/07/11 23:46:16 End to End Latency (ms) p90: 9713.000

2024/07/11 23:46:16 Running load test with concurrency 40...
2024/07/11 23:48:29 ============ Serving Benchmark Result ============
2024/07/11 23:48:29 Benchmark Duration (sec): 132.79
2024/07/11 23:48:29 Number of total requests: 410
2024/07/11 23:48:29 Success Rate (Percent): 100.00
2024/07/11 23:48:29 Concurrency: 40
2024/07/11 23:48:29 Request throughput (req/sec): 3.088
2024/07/11 23:48:29 Prompt throughput (tokens/second) avg: 2278.700
2024/07/11 23:48:29 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:48:29 End to End Latency (ms) avg: 12334.346
2024/07/11 23:48:29 End to End Latency (ms) p50: 12950.000
2024/07/11 23:48:29 End to End Latency (ms) p90: 12951.000

2024/07/11 23:48:29 Running load test with concurrency 50...
2024/07/11 23:50:45 ============ Serving Benchmark Result ============
2024/07/11 23:50:45 Benchmark Duration (sec): 136.02
2024/07/11 23:50:45 Number of total requests: 420
2024/07/11 23:50:45 Success Rate (Percent): 100.00
2024/07/11 23:50:45 Concurrency: 50
2024/07/11 23:50:45 Request throughput (req/sec): 3.088
2024/07/11 23:50:45 Prompt throughput (tokens/second) avg: 2278.776
2024/07/11 23:50:45 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:50:45 End to End Latency (ms) avg: 15243.845
2024/07/11 23:50:45 End to End Latency (ms) p50: 16188.000
2024/07/11 23:50:45 End to End Latency (ms) p90: 16189.000

deploy vllm container:



docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8080:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model ${MODEL} \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --kv-cache-dtype auto \
    --enable-prefix-caching \
    --max-num-batched-tokens 8192

start vllm container:

benchmark vllm:

2024/07/12 02:23:54 Running load test with concurrency 1...
2024/07/12 02:25:54 ============ Serving Benchmark Result ============
2024/07/12 02:25:54 Benchmark Duration (sec): 120.18
2024/07/12 02:25:54 Number of total requests: 656
2024/07/12 02:25:54 Success Rate (Percent): 100.00
2024/07/12 02:25:54 Concurrency: 1
2024/07/12 02:25:54 Request throughput (req/sec): 5.459
2024/07/12 02:25:54 Prompt throughput (tokens/second) avg: 32.751
2024/07/12 02:25:54 Generation throughput (tokens/second) avg: 21.834
2024/07/12 02:25:54 End to End Latency (ms) avg: 182.445
2024/07/12 02:25:54 End to End Latency (ms) p50: 182.000
2024/07/12 02:25:54 End to End Latency (ms) p90: 183.000

2024/07/12 02:25:54 Running load test with concurrency 5...
2024/07/12 02:27:55 ============ Serving Benchmark Result ============
2024/07/12 02:27:55 Benchmark Duration (sec): 120.07
2024/07/12 02:27:55 Number of total requests: 2425
2024/07/12 02:27:55 Success Rate (Percent): 100.00
2024/07/12 02:27:55 Concurrency: 5
2024/07/12 02:27:55 Request throughput (req/sec): 20.196
2024/07/12 02:27:55 Prompt throughput (tokens/second) avg: 121.179
2024/07/12 02:27:55 Generation throughput (tokens/second) avg: 80.786
2024/07/12 02:27:55 End to End Latency (ms) avg: 246.920
2024/07/12 02:27:55 End to End Latency (ms) p50: 247.000
2024/07/12 02:27:55 End to End Latency (ms) p90: 250.000

2024/07/12 02:27:55 Running load test with concurrency 10...
2024/07/12 02:29:55 ============ Serving Benchmark Result ============
2024/07/12 02:29:55 Benchmark Duration (sec): 120.24
2024/07/12 02:29:55 Number of total requests: 4340
2024/07/12 02:29:55 Success Rate (Percent): 100.00
2024/07/12 02:29:55 Concurrency: 10
2024/07/12 02:29:55 Request throughput (req/sec): 36.096
2024/07/12 02:29:55 Prompt throughput (tokens/second) avg: 216.573
2024/07/12 02:29:55 Generation throughput (tokens/second) avg: 144.382
2024/07/12 02:29:55 End to End Latency (ms) avg: 276.402
2024/07/12 02:29:55 End to End Latency (ms) p50: 275.000
2024/07/12 02:29:55 End to End Latency (ms) p90: 282.000

2024/07/12 02:29:55 Running load test with concurrency 20...
2024/07/12 02:31:55 ============ Serving Benchmark Result ============
2024/07/12 02:31:55 Benchmark Duration (sec): 120.01
2024/07/12 02:31:55 Number of total requests: 5760
2024/07/12 02:31:55 Success Rate (Percent): 100.00
2024/07/12 02:31:55 Concurrency: 20
2024/07/12 02:31:55 Request throughput (req/sec): 47.998
2024/07/12 02:31:55 Prompt throughput (tokens/second) avg: 287.985
2024/07/12 02:31:55 Generation throughput (tokens/second) avg: 191.990
2024/07/12 02:31:55 End to End Latency (ms) avg: 416.056
2024/07/12 02:31:55 End to End Latency (ms) p50: 353.000
2024/07/12 02:31:55 End to End Latency (ms) p90: 668.000

2024/07/12 02:31:55 Running load test with concurrency 30...
2024/07/12 02:33:55 ============ Serving Benchmark Result ============
2024/07/12 02:33:55 Benchmark Duration (sec): 120.15
2024/07/12 02:33:55 Number of total requests: 7170
2024/07/12 02:33:55 Success Rate (Percent): 100.00
2024/07/12 02:33:55 Concurrency: 30
2024/07/12 02:33:55 Request throughput (req/sec): 59.675
2024/07/12 02:33:55 Prompt throughput (tokens/second) avg: 358.051
2024/07/12 02:33:55 Generation throughput (tokens/second) avg: 238.701
2024/07/12 02:33:55 End to End Latency (ms) avg: 502.089
2024/07/12 02:33:55 End to End Latency (ms) p50: 474.000
2024/07/12 02:33:55 End to End Latency (ms) p90: 502.000

2024/07/12 02:33:55 Running load test with concurrency 40...
2024/07/12 02:35:55 ============ Serving Benchmark Result ============
2024/07/12 02:35:55 Benchmark Duration (sec): 120.51
2024/07/12 02:35:55 Number of total requests: 8280
2024/07/12 02:35:55 Success Rate (Percent): 100.00
2024/07/12 02:35:55 Concurrency: 40
2024/07/12 02:35:55 Request throughput (req/sec): 68.707
2024/07/12 02:35:55 Prompt throughput (tokens/second) avg: 412.240
2024/07/12 02:35:55 Generation throughput (tokens/second) avg: 274.827
2024/07/12 02:35:55 End to End Latency (ms) avg: 581.544
2024/07/12 02:35:55 End to End Latency (ms) p50: 573.000
2024/07/12 02:35:55 End to End Latency (ms) p90: 586.000

2024/07/12 02:35:55 Running load test with concurrency 50...
2024/07/12 02:37:56 ============ Serving Benchmark Result ============
2024/07/12 02:37:56 Benchmark Duration (sec): 120.16
2024/07/12 02:37:56 Number of total requests: 8600
2024/07/12 02:37:56 Success Rate (Percent): 100.00
2024/07/12 02:37:56 Concurrency: 50
2024/07/12 02:37:56 Request throughput (req/sec): 71.573
2024/07/12 02:37:56 Prompt throughput (tokens/second) avg: 429.441
2024/07/12 02:37:56 Generation throughput (tokens/second) avg: 286.294
2024/07/12 02:37:56 End to End Latency (ms) avg: 697.933
2024/07/12 02:37:56 End to End Latency (ms) p50: 689.000
2024/07/12 02:37:56 End to End Latency (ms) p90: 707.000

Who can help?

is all in code above.

Expected behavior

better performance for concurrent requests and similar performance to vllm

actual behavior

performance degration

additional notes


commented 1 month ago

@kaiyux Could you please have a look? Thanks

commented 1 month ago

Hi @Arnold1 , how did you get the benchmark results for triton inference and vllm? Can you share your detailed steps, so I can reproduce your results quickly to see the root cause of the gap?

commented 1 week ago

Hi @Arnold1 , @sunnyqgg were you guys able to figure the root cause here ? I am also observing similar trend for llama2-7b model I am using the latest version of both trt-llm and vllm and respective latest triton servers

commented 1 week ago

Hi @ashwin-js , this's not expected, can you share your steps and commands for both?