Closed sleepwalker2017 closed 9 months ago
Thanks a lot for sharing your performance numbers @sleepwalker2017.
It is not expected and it does not match our expectations. Could you share details about the commands to run the same tests, please?
We’d like to see if we can reproduce. And, if so, fix the performance issue quickly.
Which runtime are you using when benchmarking, Python runtime or C++ runtime?
First, Python runtime does not support page attention efficiently currently, I suggest to use C++ runtime.
Second, what's your testing data? If all the prompts have the same input/output sequence length, the paged attention will not bring any benefit in this scenario.
Thanks a lot for sharing your performance numbers @sleepwalker2017.
It is not expected and it does not match our expectations. Could you share details about the commands to run the same tests, please?
We’d like to see if we can reproduce. And, if so, fix the performance issue quickly.
python build.py --model_dir /data/models/llama-7b-hf/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--paged_kv_cache \
--output_dir ./tmp-paged-attention/llama/7B/trt_engines/fp16/1-gpu/
python3 run.py --max_output_len=96 \
--tokenizer_dir /data/models/llama-7b-hf \
--engine_dir=./tmp-paged-attention/llama/7B/trt_engines/fp16/1-gpu/
input ids:
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
1,24062,1747,278,11996,732,29899,29992,22053,457,10811,29176,584,3014,310,4908,15411,1919,478,29711,1919,2627,29871,29906,29896,1919,29871,29906,29900,29900,29929,869
code modification
diff --git a/examples/llama/run.py b/examples/llama/run.py
index d42a8a2..cfb6550 100644
--- a/examples/llama/run.py
+++ b/examples/llama/run.py
@@ -237,20 +237,52 @@ def generate(
debug_tensors_to_save=None)
if runtime_rank == 0:
print(f"Running the {dtype} engine ...")
-
+ '''
input_ids, input_lengths = parse_input(input_text, input_file, tokenizer,
EOS_TOKEN,
model_config.remove_input_padding)
+ '''
+ t = np.loadtxt('len_32.csv', delimiter=',', dtype=np.int32)
+ input_ids = torch.Tensor(t).int().cuda()
+ if len(input_ids.shape) == 1:
+ input_ids = input_ids.reshape(1, -1)
+ input_lengths = torch.Tensor(input_ids.shape[0] * [input_ids.shape[-1]]).int().cuda()
max_input_length = torch.max(input_lengths).item()
+
+ if False:
+ input_ids = input_ids.reshape(1, -1)
+ print('decoder.setup', input_lengths.size(0), max_input_length, max_output_len, num_beams)
+ #import pdb; pdb.set_trace()
decoder.setup(input_lengths.size(0), max_input_length, max_output_len,
num_beams)
@
Which runtime are you using when benchmarking, Python runtime or C++ runtime?
First, Python runtime does not support page attention efficiently currently, I suggest to use C++ runtime.
Second, what's your testing data? If all the prompts have the same input/output sequence length, the paged attention will not bring any benefit in this scenario.
I see, I'm using python runtime, so it's a known issue.
Which runtime are you using when benchmarking, Python runtime or C++ runtime?
First, Python runtime does not support page attention efficiently currently, I suggest to use C++ runtime.
Second, what's your testing data? If all the prompts have the same input/output sequence length, the paged attention will not bring any benefit in this scenario.
Hello, I wonder why TRT-LLM uses much more memory than FT? Is that normal?
Which runtime are you using when benchmarking, Python runtime or C++ runtime? First, Python runtime does not support page attention efficiently currently, I suggest to use C++ runtime. Second, what's your testing data? If all the prompts have the same input/output sequence length, the paged attention will not bring any benefit in this scenario.
Hello, I wonder why TRT-LLM uses much more memory than FT? Is that normal?
At least for Llama (GPTNeoX), FasterTransformer had fp32 accumulation in unfused mha enabled by default, which isn't the case in TRT-LLM. Did you building the engine with --enable_context_fmha_fp32_acc ? I would expect that this setting lowers the memory consumption compared to FT
--enable_context_fmha_fp32_acc
Thanks, I'll try that.
Close this bug because the issue is inactivated. Feel free to ask here if you still have question/issue, we will reopen the issue.
Here is my benchmark result on A30 using llama-7b
Seems the performance using paged attention is much worse than normal. And also the performance is a little lower than FT.
Is that normal?