NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.79k stars 1.01k forks source link

pipeline parallelism not showing expected LLM inference token generation throughput improvement #1819

Open blanshard opened 5 months ago

blanshard commented 5 months ago

System Info

Who can help?

@kaiyux

Information

Tasks

Reproduction

TensorRT-LLM version: 0.10.0 Using the benchmark script in /benchmarks/suite/tensorrt_llm_bench

Example script: python3 benchmark.py --model meta-llama/Llama-2-7b-hf -tp 1 -pp 4 --workspace /workdir/Llama-2-7b-hf-tp1-pp4 --max-batch-size 16 inflight --request-rate 100 --dataset sample_dataset.json

The sample dataset uses the script generated by prepare_dataset.py script w/ total 500 requests and each request w/ 26 input tokens and max output tokens 1000.

Expected behavior

I swept different parallelism configurations (single GPU TP1PP1, TP1PP2, TP1PP4, TP2PP1, TP4PP1). The tensor parallelism performance (latency and system throughput) is expected however the pipeline parallelism does not.

I expect the pipeline parallelism does NOT improve the latency but should improve the token generation throughput.

actual behavior

The actual result showed the Throughput (tokens/second) barely improve at all with 2 GPUs and 4 GPUs pipeline parallelism.

image

additional notes

Using the Nsight system, I collected some profiling result and the problem seems related to either the pipeline scheduler or NCCL kernel.

With TP4, oneShotAllReduce NCCL kernel showed ~16us latency, image

With PP4, ncclDevKernel_SendRecv NCCL kernel used, and showed average 4.7ms latency (which is abnormal) image

byshiue commented 4 months ago

There is a known perf issue on pipeline parallelism and it should be fixed in latest main branch. Might you take a try?