Open jan-grzybek-ampere opened 3 weeks ago
Hi, I'm also having trouble reproducing NVidia claimed numbers in the table here: https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html#throughput-measurements
System Im running on is Ampere One with H100 PCIe (2x H100 PCIe).
It has 160 OCPUs, aarch64 and 2x H100 PCIe on board, each 80GiB. Driver Version: 550.90.07 CUDA Version: 12.4 base OS is CentOS Stream 9
make -C docker build make -C docker run LOCAL_USER=1
python3 ./scripts/build_wheel.py --benchmarks --trt_root /usr/local/tensorrt pip install ./build/tensorrt_llm*.whl
trtllm-build --model_config /code/tensorrt_llm/examples/llama/model_cfg.json --use_fused_mlp --gpt_attention_plugin float16 --output_dir /code/tensorrt_llm/examples/llama/engine --max_batch_size 4096 --max_input_len 2048 --reduce_fusion disable --workers 1 --max_num_tokens 8192 --use_paged_context_fmha enable --multiple_profiles enable
python benchmarks/cpp/prepare_dataset.py --output=/code/tensorrt_llm/examples/llama/dataset --tokenizer=meta-llama/Llama-2-70b-hf token-norm-dist --num-requests=2000 --input-mean=2048 --output-mean=2048 --input-stdev=0 --output-stdev=0
mpirun -n 1 --allow-run-as-root --oversubscribe cpp/build/benchmarks/gptManagerBenchmark --engine_dir /code/tensorrt_llm/examples/llama/engine --type IFB --dataset /code/tensorrt_llm/examples/llama/dataset --eos_id -1 --scheduler_policy guaranteed_no_evict --kv_cache_free_gpu_mem_fraction 0.99 --output_csv result.csv --request_rate -1.0 --enable_chunked_context --warm_up 0
[BENCHMARK] num_samples 2000 [BENCHMARK] num_error_samples 0
[BENCHMARK] num_samples 2000 [BENCHMARK] total_latency(ms) 5157260.50 [BENCHMARK] seq_throughput(seq/sec) 0.39 [BENCHMARK] token_throughput(token/sec) 794.22
[BENCHMARK] avg_sequence_latency(ms) 2744871.25 [BENCHMARK] max_sequence_latency(ms) 5157236.50 [BENCHMARK] min_sequence_latency(ms) 294931.25 [BENCHMARK] p99_sequence_latency(ms) 5156991.50 [BENCHMARK] p90_sequence_latency(ms) 4785110.00 [BENCHMARK] p50_sequence_latency(ms) 2722596.50
Results are 2x lower than claimed - 794.22 tps vs 2457.73 tps (for tp=2, in=2048, out=2048). During run I see both H100 PCIe working at ~100% util (nvidia-smi). Any hints on how to achieve the claimed performance? Thanks.
System Info
Hi, I'm having trouble reproducing NVidia claimed numbers in the table here: https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html#throughput-measurements
System Im running on is AWS instance of g6e.12xlarge (4x L40S).
I closely followed instructions in the link above.
docker image build:
trt-llm build / install:
trt runner generation:
data generation:
benchmark run:
stdout:
Results are 2x lower than claimed - 448 tps vs 891 tps (for tp=4, in=2048, out=2048). During run I see all 4x L40S working at ~100% util (nvidia-smi). Cooling is naturally top-notch at AWS. Any hints on how to achieve the claimed performance? Thanks.
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
891 tps token throughput
actual behavior
448 tps token throughput
additional notes
No additional notes