TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Here when calling trtllm-build I set --max_input_len=4009 \ --max_output_len=4009 ,but there is a bug when calling /TensorRT-LLM/benchmarks/python/benchmark.py with --input_output_len "2048,2048" :
Traceback (most recent call last):
File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 412, in <module>
main(args)
File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 371, in main
e.with_traceback())
TypeError: BaseException.with_traceback() takes exactly one argument (0 given)
[04/26/2024-04:57:33] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[04/26/2024-04:57:33] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
Traceback (most recent call last):
File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 346, in main
benchmarker.run(inputs, config)
File "/TensorRT-LLM/benchmarks/python/gpt_benchmark.py", line 220, in run
self.decoder.decode_batch(inputs[0],
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2803, in decode_batch
return self.decode(input_ids,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 789, in wrapper
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2993, in decode
return self.decode_regular(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2642, in decode_regular
should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, context_logits, generation_logits, encoder_input_lengths = self.handle_per_step(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2334, in handle_per_step
raise RuntimeError(f"Executing TRT engine failed step={step}!")
RuntimeError: Executing TRT engine failed step=0!
Actually the pre-defined 4009 is greater than 2048 , so why this bug occurs?
I tried "2048,128" and it works but "2048,2048" fails.
What is the relationship between max_input_lenmax_output_len and input_output_len ?
Expected behavior
No bug
actual behavior
bug
additional notes
What is the relationship between max_input_lenmax_output_len and input_output_len ?
System Info
NVIDIA H20 97871MiB * 8
trt-llm 0.9.0
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Model: Llama2-70b-chat-hf ; 8 GPU 1 node
Here when calling
trtllm-build
I set--max_input_len=4009 \ --max_output_len=4009
,but there is a bug when calling/TensorRT-LLM/benchmarks/python/benchmark.py
with--input_output_len "2048,2048"
:Actually the pre-defined 4009 is greater than 2048 , so why this bug occurs? I tried "2048,128" and it works but "2048,2048" fails.
What is the relationship between
max_input_len
max_output_len
andinput_output_len
?Expected behavior
No bug
actual behavior
bug
additional notes
What is the relationship between
max_input_len
max_output_len
andinput_output_len
?