pipeline parallelism not showing expected LLM inference token generation throughput improvement

System Info

4x A100 SXM 40GB
CUDA 12.4
Docker: nvidia/cuda:12.4.0-devel-ubuntu22.04
TensorRT-LLM version: 0.10.0

Who can help?

@kaiyux

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

TensorRT-LLM version: 0.10.0 Using the benchmark script in /benchmarks/suite/tensorrt_llm_bench

Example script: python3 benchmark.py --model meta-llama/Llama-2-7b-hf -tp 1 -pp 4 --workspace /workdir/Llama-2-7b-hf-tp1-pp4 --max-batch-size 16 inflight --request-rate 100 --dataset sample_dataset.json

The sample dataset uses the script generated by prepare_dataset.py script w/ total 500 requests and each request w/ 26 input tokens and max output tokens 1000.

Expected behavior

I swept different parallelism configurations (single GPU TP1PP1, TP1PP2, TP1PP4, TP2PP1, TP4PP1). The tensor parallelism performance (latency and system throughput) is expected however the pipeline parallelism does not.

I expect the pipeline parallelism does NOT improve the latency but should improve the token generation throughput.

actual behavior

The actual result showed the Throughput (tokens/second) barely improve at all with 2 GPUs and 4 GPUs pipeline parallelism.

additional notes

Using the Nsight system, I collected some profiling result and the problem seems related to either the pipeline scheduler or NCCL kernel.

With TP4, oneShotAllReduce NCCL kernel showed ~16us latency,

With PP4, ncclDevKernel_SendRecv NCCL kernel used, and showed average 4.7ms latency (which is abnormal)

NVIDIA / TensorRT-LLM