Difficulty reproducing NVidia numbers on 4x L40S system.

System Info

Hi, I'm having trouble reproducing NVidia claimed numbers in the table here: https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html#throughput-measurements

System Im running on is AWS instance of g6e.12xlarge (4x L40S).

It has 48t AMD EPYC 7R13 CPU, x86 and 4x L40S on board, each 48GiB.
Driver Version: 550.90.07
CUDA Version: 12.5
base OS is Ubuntu 22.04
tensorrt-llm: 32ed92e4491baf2d54682a21d247e1948cca996e

I closely followed instructions in the link above.

docker image build:

make -C docker run LOCAL_USER=1 GPU_OPTS='--gpus \"device=0,1,2,3\"'

trt-llm build / install:

python3 ./scripts/build_wheel.py --benchmarks --trt_root /usr/local/tensorrt pip install ./build/tensorrt_llm*.whl

trt runner generation:

trtllm-build --model_config llama2_70b.json --use_fused_mlp enable --gpt_attention_plugin float16 --output_dir llama2_70b_4w --max_batch_size 4096 --max_input_len 2048 --reduce_fusion disable --workers 4 --max_num_tokens 8192 --use_paged_context_fmha enable --multiple_profiles enable

data generation:

python3 benchmarks/cpp/prepare_dataset.py --output=20482048data --tokenizer=meta-llama/Llama-2-70b-hf token-norm-dist --num-requests=1500 --input-mean=2048 --output-mean=2048 --input-stdev=0 --output-stdev=0

benchmark run:

mpirun -n 4 --allow-run-as-root --oversubscribe cpp/build/benchmarks/gptManagerBenchmark --engine_dir llama2_70b_4w/ --type IFB --dataset 20482048data --eos_id -1 --scheduler_policy guaranteed_no_evict --kv_cache_free_gpu_mem_fraction 0.99 --output_csv result.csv --request_rate -1.0 --enable_chunked_context --warm_up 0

stdout:

[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 3 is not available. [BENCHMARK] num_samples 1500 [BENCHMARK] num_error_samples 0

[BENCHMARK] num_samples 1500 [BENCHMARK] total_latency(ms) 6847206.50 [BENCHMARK] seq_throughput(seq/sec) 0.22 [BENCHMARK] token_throughput(token/sec) 448.65

[BENCHMARK] avg_sequence_latency(ms) 3791355.50 [BENCHMARK] max_sequence_latency(ms) 6847191.50 [BENCHMARK] min_sequence_latency(ms) 599042.12 [BENCHMARK] p99_sequence_latency(ms) 6846964.50 [BENCHMARK] p90_sequence_latency(ms) 6649171.50 [BENCHMARK] p50_sequence_latency(ms) 3752817.75

Results are 2x lower than claimed - 448 tps vs 891 tps (for tp=4, in=2048, out=2048). During run I see all 4x L40S working at ~100% util (nvidia-smi). Cooling is naturally top-notch at AWS. Any hints on how to achieve the claimed performance? Thanks.

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

trtllm-build --model_config llama2_70b.json --use_fused_mlp enable --gpt_attention_plugin float16 --output_dir llama2_70b_4w --max_batch_size 4096 --max_input_len 2048 --reduce_fusion disable --workers 4 --max_num_tokens 8192 --use_paged_context_fmha enable --multiple_profiles enable
python3 benchmarks/cpp/prepare_dataset.py --output=20482048data --tokenizer=meta-llama/Llama-2-70b-hf token-norm-dist --num-requests=1500 --input-mean=2048 --output-mean=2048 --input-stdev=0 --output-stdev=0
mpirun -n 4 --allow-run-as-root --oversubscribe cpp/build/benchmarks/gptManagerBenchmark --engine_dir llama2_70b_4w/ --type IFB --dataset 20482048data --eos_id -1 --scheduler_policy guaranteed_no_evict --kv_cache_free_gpu_mem_fraction 0.99 --output_csv result.csv --request_rate -1.0 --enable_chunked_context --warm_up 0

Expected behavior

891 tps token throughput

actual behavior

448 tps token throughput

additional notes

No additional notes

System Info

Hi, I'm also having trouble reproducing NVidia claimed numbers in the table here: https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html#throughput-measurements

System Im running on is Ampere One with H100 PCIe (2x H100 PCIe).

It has 160 OCPUs, aarch64 and 2x H100 PCIe on board, each 80GiB. Driver Version: 550.90.07 CUDA Version: 12.4 base OS is CentOS Stream 9

docker image build:

make -C docker build make -C docker run LOCAL_USER=1

trt-llm build / install:

python3 ./scripts/build_wheel.py --benchmarks --trt_root /usr/local/tensorrt pip install ./build/tensorrt_llm*.whl

trt runner generation:

trtllm-build --model_config /code/tensorrt_llm/examples/llama/model_cfg.json --use_fused_mlp --gpt_attention_plugin float16 --output_dir /code/tensorrt_llm/examples/llama/engine --max_batch_size 4096 --max_input_len 2048 --reduce_fusion disable --workers 1 --max_num_tokens 8192 --use_paged_context_fmha enable --multiple_profiles enable

data generation:

python benchmarks/cpp/prepare_dataset.py --output=/code/tensorrt_llm/examples/llama/dataset --tokenizer=meta-llama/Llama-2-70b-hf token-norm-dist --num-requests=2000 --input-mean=2048 --output-mean=2048 --input-stdev=0 --output-stdev=0

benchmark run:

mpirun -n 1 --allow-run-as-root --oversubscribe cpp/build/benchmarks/gptManagerBenchmark --engine_dir /code/tensorrt_llm/examples/llama/engine --type IFB --dataset /code/tensorrt_llm/examples/llama/dataset --eos_id -1 --scheduler_policy guaranteed_no_evict --kv_cache_free_gpu_mem_fraction 0.99 --output_csv result.csv --request_rate -1.0 --enable_chunked_context --warm_up 0

stdout:

[BENCHMARK] num_samples 2000 [BENCHMARK] num_error_samples 0

[BENCHMARK] num_samples 2000 [BENCHMARK] total_latency(ms) 5157260.50 [BENCHMARK] seq_throughput(seq/sec) 0.39 [BENCHMARK] token_throughput(token/sec) 794.22

[BENCHMARK] avg_sequence_latency(ms) 2744871.25 [BENCHMARK] max_sequence_latency(ms) 5157236.50 [BENCHMARK] min_sequence_latency(ms) 294931.25 [BENCHMARK] p99_sequence_latency(ms) 5156991.50 [BENCHMARK] p90_sequence_latency(ms) 4785110.00 [BENCHMARK] p50_sequence_latency(ms) 2722596.50

Results are 2x lower than claimed - 794.22 tps vs 2457.73 tps (for tp=2, in=2048, out=2048). During run I see both H100 PCIe working at ~100% util (nvidia-smi). Any hints on how to achieve the claimed performance? Thanks.

NVIDIA / TensorRT-LLM