NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.19k stars 908 forks source link

Performance decay when using paged attention #75

Closed sleepwalker2017 closed 9 months ago

sleepwalker2017 commented 11 months ago

Here is my benchmark result on A30 using llama-7b

image

Seems the performance using paged attention is much worse than normal. And also the performance is a little lower than FT.

Is that normal?

jdemouth-nvidia commented 11 months ago

Thanks a lot for sharing your performance numbers @sleepwalker2017.

It is not expected and it does not match our expectations. Could you share details about the commands to run the same tests, please?

We’d like to see if we can reproduce. And, if so, fix the performance issue quickly.

QiJune commented 11 months ago

Which runtime are you using when benchmarking, Python runtime or C++ runtime?

First, Python runtime does not support page attention efficiently currently, I suggest to use C++ runtime.

Second, what's your testing data? If all the prompts have the same input/output sequence length, the paged attention will not bring any benefit in this scenario.

sleepwalker2017 commented 11 months ago

Thanks a lot for sharing your performance numbers @sleepwalker2017.

It is not expected and it does not match our expectations. Could you share details about the commands to run the same tests, please?

We’d like to see if we can reproduce. And, if so, fix the performance issue quickly.

python build.py --model_dir /data/models/llama-7b-hf/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --paged_kv_cache \
                --output_dir ./tmp-paged-attention/llama/7B/trt_engines/fp16/1-gpu/
python3 run.py --max_output_len=96 \
               --tokenizer_dir /data/models/llama-7b-hf \
               --engine_dir=./tmp-paged-attention/llama/7B/trt_engines/fp16/1-gpu/

input ids:

1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
sleepwalker2017 commented 11 months ago

code modification

diff --git a/examples/llama/run.py b/examples/llama/run.py
index d42a8a2..cfb6550 100644
--- a/examples/llama/run.py
+++ b/examples/llama/run.py
@@ -237,20 +237,52 @@ def generate(
                                                      debug_tensors_to_save=None)
     if runtime_rank == 0:
         print(f"Running the {dtype} engine ...")
-
+    '''
     input_ids, input_lengths = parse_input(input_text, input_file, tokenizer,
                                            EOS_TOKEN,
                                            model_config.remove_input_padding)
+    '''
+    t = np.loadtxt('len_32.csv', delimiter=',', dtype=np.int32)
+    input_ids = torch.Tensor(t).int().cuda()

+    if len(input_ids.shape) == 1:
+        input_ids = input_ids.reshape(1, -1)
+    input_lengths = torch.Tensor(input_ids.shape[0] * [input_ids.shape[-1]]).int().cuda()
     max_input_length = torch.max(input_lengths).item()
+
+    if False:
+        input_ids = input_ids.reshape(1, -1)
+    print('decoder.setup', input_lengths.size(0), max_input_length, max_output_len, num_beams)
+    #import pdb; pdb.set_trace()
     decoder.setup(input_lengths.size(0), max_input_length, max_output_len,
                   num_beams)
sleepwalker2017 commented 10 months ago

@

Which runtime are you using when benchmarking, Python runtime or C++ runtime?

First, Python runtime does not support page attention efficiently currently, I suggest to use C++ runtime.

Second, what's your testing data? If all the prompts have the same input/output sequence length, the paged attention will not bring any benefit in this scenario.

I see, I'm using python runtime, so it's a known issue.

sleepwalker2017 commented 10 months ago

Which runtime are you using when benchmarking, Python runtime or C++ runtime?

First, Python runtime does not support page attention efficiently currently, I suggest to use C++ runtime.

Second, what's your testing data? If all the prompts have the same input/output sequence length, the paged attention will not bring any benefit in this scenario.

Hello, I wonder why TRT-LLM uses much more memory than FT? Is that normal?

ryxli commented 10 months ago

Which runtime are you using when benchmarking, Python runtime or C++ runtime? First, Python runtime does not support page attention efficiently currently, I suggest to use C++ runtime. Second, what's your testing data? If all the prompts have the same input/output sequence length, the paged attention will not bring any benefit in this scenario.

Hello, I wonder why TRT-LLM uses much more memory than FT? Is that normal?

At least for Llama (GPTNeoX), FasterTransformer had fp32 accumulation in unfused mha enabled by default, which isn't the case in TRT-LLM. Did you building the engine with --enable_context_fmha_fp32_acc ? I would expect that this setting lowers the memory consumption compared to FT

sleepwalker2017 commented 10 months ago

--enable_context_fmha_fp32_acc

Thanks, I'll try that.

byshiue commented 9 months ago

Close this bug because the issue is inactivated. Feel free to ask here if you still have question/issue, we will reopen the issue.