Open taozhang9527 opened 10 months ago
try tritonbackend?
Thanks, can you be more specific on the instructions of tritonbackend you tried?
I think that's what I did. Basically, I pulled the trionserver with the following command:
docker pull nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
Then I followed the instructions here.
Can you share your scripts to run triton backend and the results you observe? You shared above are on gptSessionBenchmark
.
same issue, stay tuned
same issue
Can you share your scripts to run triton backend and the results you observe?
try tritonbackend?
Do you mean that the benchmark tool of TensorRT-LLM not utilizing the in-flight batching?
@byshiue actually, for llama_7b benchmarked by python, I did not see any performance boost by enabling FMHA or IFB or KV Cache or Quantization. It is wired. Do you have any insights?
To get better performance, please try the cpp runtime or use triton backend when you enable IFB.
actually, for llama_7b benchmarked by python, I did not see any performance boost by enabling FMHA or IFB or KV Cache or Quantization. It is wired. Do you have any insights?
Please share your environment, scripts for each test and the performance number of each test.
We recomend to use c++ runtime to benchmark because the python runtime is not optimized.
Thank you, I am trying the C++ runtime.
Trying to see how the in-flight batching and pagedattention help with the throughput based on Llama-7b model.
Scenario 1
python3 examples/llama/build.py --model_dir Llama-2-7b-chat-hf --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --max_output_len 2048 --output_dir examples/llama/out/7b/fp16_1gpu
./benchmarks/gptManagerBenchmark --model llama --engine_dir ../../examples/llama/out/7b/fp16_1gpu/ --type V1 --dataset ../../benchmarks/cpp/preprocessed_dataset.json
Results: 94.09 tokens/s
Scenario 2
python3 examples/llama/build.py --model_dir Llama-2-7b-chat-hf --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --remove_input_padding --use_inflight_batching --max_output_len 2048 --output_dir examples/llama/out/7b/fp16_1gpu_ifb
./benchmarks/gptManagerBenchmark --model llama --engine_dir ../../examples/llama/out/7b/fp16_1gpu_ifb/ --type IFB --dataset ../../benchmarks/cpp/preprocessed_dataset.json
Results: 91.87 tokens/s
Scenario 3 ./cpp/build/benchmarks/gptSessionBenchmark --model llama --engine_dir ./examples/llama/out/7b/fp16_1gpu/ --batch_size "1" --input_output_len "512, 200"
Results: 51 tokens/s
Docker version:
23.10-trtllm-python-py3
Driver version:535.129.03
Cuda version:12.2
GPU:L40
Questions: (1). Comparing scenario 1 and 2, why in-flight batching and paged-attention did not make difference? (2). Comparing scenario 1 and 3, why using gptManagerBenchmark and gptSessionBenchmark gave such large difference? (3). Is the performance related to what kind of input data used? For scenario 1 and 2, I am using the data mentioned in issue #294, for scenario 3, I am using whatever the benchmark script picks (4). For scenario 1 and 2, did I miss the step of converting HF transformer to FT described in tensorrtllm-backend? If that's the case, I don't understand it because the build.py does not use the converting folder in the example. E.g.,
Convert weights from HF Tranformers to FT format
python3 hf_gpt_convert.py -p 8 -i gpt2 -o **./c-model/gpt2** --tensor-parallelism 4 --storage-type float16
Build TensorRT engines