TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
The sample dataset uses the script generated by prepare_dataset.py script w/ total 500 requests and each request w/ 26 input tokens and max output tokens 1000.
Expected behavior
I swept different parallelism configurations (single GPU TP1PP1, TP1PP2, TP1PP4, TP2PP1, TP4PP1).
The tensor parallelism performance (latency and system throughput) is expected however the pipeline parallelism does not.
I expect the pipeline parallelism does NOT improve the latency but should improve the token generation throughput.
actual behavior
The actual result showed the Throughput (tokens/second) barely improve at all with 2 GPUs and 4 GPUs pipeline parallelism.
additional notes
Using the Nsight system, I collected some profiling result and the problem seems related to either the pipeline scheduler or NCCL kernel.
With TP4, oneShotAllReduce NCCL kernel showed ~16us latency,
With PP4, ncclDevKernel_SendRecv NCCL kernel used, and showed average 4.7ms latency (which is abnormal)
System Info
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
TensorRT-LLM version: 0.10.0 Using the benchmark script in /benchmarks/suite/tensorrt_llm_bench
Example script: python3 benchmark.py --model meta-llama/Llama-2-7b-hf -tp 1 -pp 4 --workspace /workdir/Llama-2-7b-hf-tp1-pp4 --max-batch-size 16 inflight --request-rate 100 --dataset sample_dataset.json
The sample dataset uses the script generated by prepare_dataset.py script w/ total 500 requests and each request w/ 26 input tokens and max output tokens 1000.
Expected behavior
I swept different parallelism configurations (single GPU TP1PP1, TP1PP2, TP1PP4, TP2PP1, TP4PP1). The tensor parallelism performance (latency and system throughput) is expected however the pipeline parallelism does not.
I expect the pipeline parallelism does NOT improve the latency but should improve the token generation throughput.
actual behavior
The actual result showed the Throughput (tokens/second) barely improve at all with 2 GPUs and 4 GPUs pipeline parallelism.
additional notes
Using the Nsight system, I collected some profiling result and the problem seems related to either the pipeline scheduler or NCCL kernel.
With TP4, oneShotAllReduce NCCL kernel showed ~16us latency,
With PP4, ncclDevKernel_SendRecv NCCL kernel used, and showed average 4.7ms latency (which is abnormal)