NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.16k stars 903 forks source link

Difficulty reproducing NVidia numbers on 4x L40S system. #2140

Open jan-grzybek-ampere opened 3 weeks ago

jan-grzybek-ampere commented 3 weeks ago

System Info

Hi, I'm having trouble reproducing NVidia claimed numbers in the table here: https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html#throughput-measurements

System Im running on is AWS instance of g6e.12xlarge (4x L40S).

I closely followed instructions in the link above.

docker image build:

make -C docker run LOCAL_USER=1 GPU_OPTS='--gpus \"device=0,1,2,3\"'

trt-llm build / install:

python3 ./scripts/build_wheel.py --benchmarks --trt_root /usr/local/tensorrt pip install ./build/tensorrt_llm*.whl

trt runner generation:

trtllm-build --model_config llama2_70b.json --use_fused_mlp enable --gpt_attention_plugin float16 --output_dir llama2_70b_4w --max_batch_size 4096 --max_input_len 2048 --reduce_fusion disable --workers 4 --max_num_tokens 8192 --use_paged_context_fmha enable --multiple_profiles enable

data generation:

python3 benchmarks/cpp/prepare_dataset.py --output=20482048data --tokenizer=meta-llama/Llama-2-70b-hf token-norm-dist --num-requests=1500 --input-mean=2048 --output-mean=2048 --input-stdev=0 --output-stdev=0

benchmark run:

mpirun -n 4 --allow-run-as-root --oversubscribe cpp/build/benchmarks/gptManagerBenchmark --engine_dir llama2_70b_4w/ --type IFB --dataset 20482048data --eos_id -1 --scheduler_policy guaranteed_no_evict --kv_cache_free_gpu_mem_fraction 0.99 --output_csv result.csv --request_rate -1.0 --enable_chunked_context --warm_up 0

stdout:

[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 3 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 0 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 2 peer access Device 3 is not available. [BENCHMARK] num_samples 1500 [BENCHMARK] num_error_samples 0

[BENCHMARK] num_samples 1500 [BENCHMARK] total_latency(ms) 6847206.50 [BENCHMARK] seq_throughput(seq/sec) 0.22 [BENCHMARK] token_throughput(token/sec) 448.65

[BENCHMARK] avg_sequence_latency(ms) 3791355.50 [BENCHMARK] max_sequence_latency(ms) 6847191.50 [BENCHMARK] min_sequence_latency(ms) 599042.12 [BENCHMARK] p99_sequence_latency(ms) 6846964.50 [BENCHMARK] p90_sequence_latency(ms) 6649171.50 [BENCHMARK] p50_sequence_latency(ms) 3752817.75

Results are 2x lower than claimed - 448 tps vs 891 tps (for tp=4, in=2048, out=2048). During run I see all 4x L40S working at ~100% util (nvidia-smi). Cooling is naturally top-notch at AWS. Any hints on how to achieve the claimed performance? Thanks.

Who can help?

No response

Information

Tasks

Reproduction

  1. trtllm-build --model_config llama2_70b.json --use_fused_mlp enable --gpt_attention_plugin float16 --output_dir llama2_70b_4w --max_batch_size 4096 --max_input_len 2048 --reduce_fusion disable --workers 4 --max_num_tokens 8192 --use_paged_context_fmha enable --multiple_profiles enable
  2. python3 benchmarks/cpp/prepare_dataset.py --output=20482048data --tokenizer=meta-llama/Llama-2-70b-hf token-norm-dist --num-requests=1500 --input-mean=2048 --output-mean=2048 --input-stdev=0 --output-stdev=0
  3. mpirun -n 4 --allow-run-as-root --oversubscribe cpp/build/benchmarks/gptManagerBenchmark --engine_dir llama2_70b_4w/ --type IFB --dataset 20482048data --eos_id -1 --scheduler_policy guaranteed_no_evict --kv_cache_free_gpu_mem_fraction 0.99 --output_csv result.csv --request_rate -1.0 --enable_chunked_context --warm_up 0

Expected behavior

891 tps token throughput

actual behavior

448 tps token throughput

additional notes

No additional notes

MarcelWilnicki commented 2 weeks ago

System Info

Hi, I'm also having trouble reproducing NVidia claimed numbers in the table here: https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html#throughput-measurements

System Im running on is Ampere One with H100 PCIe (2x H100 PCIe).

It has 160 OCPUs, aarch64 and 2x H100 PCIe on board, each 80GiB. Driver Version: 550.90.07 CUDA Version: 12.4 base OS is CentOS Stream 9

docker image build:

make -C docker build make -C docker run LOCAL_USER=1

trt-llm build / install:

python3 ./scripts/build_wheel.py --benchmarks --trt_root /usr/local/tensorrt pip install ./build/tensorrt_llm*.whl

trt runner generation:

trtllm-build --model_config /code/tensorrt_llm/examples/llama/model_cfg.json --use_fused_mlp --gpt_attention_plugin float16 --output_dir /code/tensorrt_llm/examples/llama/engine --max_batch_size 4096 --max_input_len 2048 --reduce_fusion disable --workers 1 --max_num_tokens 8192 --use_paged_context_fmha enable --multiple_profiles enable

data generation:

python benchmarks/cpp/prepare_dataset.py --output=/code/tensorrt_llm/examples/llama/dataset --tokenizer=meta-llama/Llama-2-70b-hf token-norm-dist --num-requests=2000 --input-mean=2048 --output-mean=2048 --input-stdev=0 --output-stdev=0

benchmark run:

mpirun -n 1 --allow-run-as-root --oversubscribe cpp/build/benchmarks/gptManagerBenchmark --engine_dir /code/tensorrt_llm/examples/llama/engine --type IFB --dataset /code/tensorrt_llm/examples/llama/dataset --eos_id -1 --scheduler_policy guaranteed_no_evict --kv_cache_free_gpu_mem_fraction 0.99 --output_csv result.csv --request_rate -1.0 --enable_chunked_context --warm_up 0

stdout:

[BENCHMARK] num_samples 2000 [BENCHMARK] num_error_samples 0

[BENCHMARK] num_samples 2000 [BENCHMARK] total_latency(ms) 5157260.50 [BENCHMARK] seq_throughput(seq/sec) 0.39 [BENCHMARK] token_throughput(token/sec) 794.22

[BENCHMARK] avg_sequence_latency(ms) 2744871.25 [BENCHMARK] max_sequence_latency(ms) 5157236.50 [BENCHMARK] min_sequence_latency(ms) 294931.25 [BENCHMARK] p99_sequence_latency(ms) 5156991.50 [BENCHMARK] p90_sequence_latency(ms) 4785110.00 [BENCHMARK] p50_sequence_latency(ms) 2722596.50

Results are 2x lower than claimed - 794.22 tps vs 2457.73 tps (for tp=2, in=2048, out=2048). During run I see both H100 PCIe working at ~100% util (nvidia-smi). Any hints on how to achieve the claimed performance? Thanks.