taozhang9527 commented 10 months ago

Trying to see how the in-flight batching and pagedattention help with the throughput based on Llama-7b model.

Scenario 1 python3 examples/llama/build.py --model_dir Llama-2-7b-chat-hf --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --max_output_len 2048 --output_dir examples/llama/out/7b/fp16_1gpu

./benchmarks/gptManagerBenchmark --model llama --engine_dir ../../examples/llama/out/7b/fp16_1gpu/ --type V1 --dataset ../../benchmarks/cpp/preprocessed_dataset.json

Results: 94.09 tokens/s

Scenario 2 python3 examples/llama/build.py --model_dir Llama-2-7b-chat-hf --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --remove_input_padding --use_inflight_batching --max_output_len 2048 --output_dir examples/llama/out/7b/fp16_1gpu_ifb

./benchmarks/gptManagerBenchmark --model llama --engine_dir ../../examples/llama/out/7b/fp16_1gpu_ifb/ --type IFB --dataset ../../benchmarks/cpp/preprocessed_dataset.json

Results: 91.87 tokens/s

Scenario 3 ./cpp/build/benchmarks/gptSessionBenchmark --model llama --engine_dir ./examples/llama/out/7b/fp16_1gpu/ --batch_size "1" --input_output_len "512, 200"

Results: 51 tokens/s

Docker version: 23.10-trtllm-python-py3 Driver version: 535.129.03 Cuda version: 12.2 GPU: L40

Questions: (1). Comparing scenario 1 and 2, why in-flight batching and paged-attention did not make difference? (2). Comparing scenario 1 and 3, why using gptManagerBenchmark and gptSessionBenchmark gave such large difference? (3). Is the performance related to what kind of input data used? For scenario 1 and 2, I am using the data mentioned in issue #294, for scenario 3, I am using whatever the benchmark script picks (4). For scenario 1 and 2, did I miss the step of converting HF transformer to FT described in tensorrtllm-backend? If that's the case, I don't understand it because the build.py does not use the converting folder in the example. E.g.,

Convert weights from HF Tranformers to FT format

python3 hf_gpt_convert.py -p 8 -i gpt2 -o **./c-model/gpt2** --tensor-parallelism 4 --storage-type float16

Build TensorRT engines

python3 build.py --model_dir=**./c-model/gpt2/4-gpu/** \
                 --world_size=4 \
                 --dtype float16 \
                 --use_inflight_batching \
                 --use_gpt_attention_plugin float16 \
                 --paged_kv_cache \
                 --use_gemm_plugin float16 \
                 --remove_input_padding \
                 --use_layernorm_plugin float16 \
                 --hidden_act gelu \
                 --parallel_build \
                 --output_dir=engines/fp16/4-gpu

shiqingzhangCSU commented 10 months ago

try tritonbackend？

taozhang9527 commented 10 months ago

Thanks, can you be more specific on the instructions of tritonbackend you tried?

I think that's what I did. Basically, I pulled the trionserver with the following command: docker pull nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 Then I followed the instructions here.

byshiue commented 10 months ago

Can you share your scripts to run triton backend and the results you observe? You shared above are on gptSessionBenchmark.

renwuli commented 9 months ago

same issue, stay tuned

Hap-Zhang commented 9 months ago

same issue

byshiue commented 9 months ago

Can you share your scripts to run triton backend and the results you observe?

tianliplus commented 9 months ago

try tritonbackend？

Do you mean that the benchmark tool of TensorRT-LLM not utilizing the in-flight batching?

renwuli commented 9 months ago

@byshiue actually, for llama_7b benchmarked by python, I did not see any performance boost by enabling FMHA or IFB or KV Cache or Quantization. It is wired. Do you have any insights?

byshiue commented 9 months ago

To get better performance, please try the cpp runtime or use triton backend when you enable IFB.

actually, for llama_7b benchmarked by python, I did not see any performance boost by enabling FMHA or IFB or KV Cache or Quantization. It is wired. Do you have any insights?

Please share your environment, scripts for each test and the performance number of each test.

byshiue commented 9 months ago

We recomend to use c++ runtime to benchmark because the python runtime is not optimized.

renwuli commented 9 months ago

Thank you, I am trying the C++ runtime.

NVIDIA / TensorRT-LLM

Do Not See Performance Boost When In-flight Batching and PagedAttention Enabled #370

Convert weights from HF Tranformers to FT format

Build TensorRT engines