TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
python ../run.py --max_output_len=64 \
--tokenizer_dir ../../model_weights/komt-mistral-7b-v1 \
--engine_dir=/tmp/mistral_komt_lora/7B/trt_engines/fp8/1-gpu/ \
--input_text "[INST]오늘은 날씨가 아주 좋다 내가 공원에 갔을 때 [/INST]" \
--max_attention_window_size=128 \
--lora_task_uids 0 \
--use_py_session \
--temperature 0.2 \
--num_beams 3
Expected behavior
Response text from model using beam search algorithm.
actual behavior
[06/14/2024-12:27:38] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.147.05. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100
[06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.11.2'}
[06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.mup_width_multiplier = 1.0
[06/14/2024-12:27:42] [TRT-LLM] [I] Set dtype to float16.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set gemm_plugin to auto.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set identity_plugin to None.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set nccl_plugin to None.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set lookup_plugin to None.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set lora_plugin to auto.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set moe_plugin to auto.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set context_fmha to False.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set paged_kv_cache to True.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set remove_input_padding to True.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set reduce_fusion to False.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set multi_block_mode to False.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set enable_xqa to True.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set multiple_profiles to False.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set paged_state to True.
[06/14/2024-12:27:42] [TRT-LLM] [I] Set streamingllm to False.
[06/14/2024-12:27:42] [TRT] [I] Loaded engine size: 7182 MiB
[06/14/2024-12:27:43] [TRT] [I] [MS] Running engine with multi stream info
[06/14/2024-12:27:43] [TRT] [I] [MS] Number of aux streams is 1
[06/14/2024-12:27:43] [TRT] [I] [MS] Number of total worker streams is 2
[06/14/2024-12:27:43] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[06/14/2024-12:27:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 7173 (MiB)
[06/14/2024-12:27:43] [TRT] [I] [MS] Running engine with multi stream info
[06/14/2024-12:27:43] [TRT] [I] [MS] Number of aux streams is 1
[06/14/2024-12:27:43] [TRT] [I] [MS] Number of total worker streams is 2
[06/14/2024-12:27:43] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[06/14/2024-12:27:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 7173 (MiB)
[06/14/2024-12:27:43] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.
[06/14/2024-12:27:44] [TRT-LLM] [I] Load engine takes: 4.506639003753662 sec
[06/14/2024-12:27:44] [TRT-LLM] [W] The value of max_attention_window_size should ideally not exceed max_seq_length. Therefore, it has been adjusted to match the value of max_seq_length.
/usr/local/lib/python3.10/dist-packages/torch/nested/init.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
[06/14/2024-12:27:45] [TRT] [E] 7: [shapeMachine.cpp::executeContinuation::905] Error Code 7: Internal Error (Dimensions with name batch_size_beam_width must be equal. Condition '==' violated: 1 != 3. Instruction: CHECK_EQUAL 1 3.)
Traceback (most recent call last):
File "/code/tensorrt_llm/examples/llama/../run.py", line 504, in
main(args)
File "/code/tensorrt_llm/examples/llama/../run.py", line 344, in main
outputs = runner.generate(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 796, in generate
outputs = self.session.decode(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 947, in wrapper
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3463, in decode
return self.decode_regular(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3073, in decode_regular
should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, context_logits, generation_logits, encoder_input_lengths = self.handle_per_step(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2732, in handle_per_step
raise RuntimeError(f"Executing TRT engine failed step={step}!")
RuntimeError: Executing TRT engine failed step=1!
additional notes
How to configure the batch_size_beam_width parameter?
System Info
Who can help?
@ncomly-nvidia @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Response text from model using beam search algorithm.
actual behavior
[06/14/2024-12:27:38] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.147.05. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage. [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100 [06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.11.2'} [06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False [06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0 [06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0 [06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama [06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True [06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False [06/14/2024-12:27:42] [TRT-LLM] [W] Implicitly setting LLaMAConfig.mup_width_multiplier = 1.0 [06/14/2024-12:27:42] [TRT-LLM] [I] Set dtype to float16. [06/14/2024-12:27:42] [TRT-LLM] [I] Set bert_attention_plugin to auto. [06/14/2024-12:27:42] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [06/14/2024-12:27:42] [TRT-LLM] [I] Set gemm_plugin to auto. [06/14/2024-12:27:42] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [06/14/2024-12:27:42] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None. [06/14/2024-12:27:42] [TRT-LLM] [I] Set identity_plugin to None. [06/14/2024-12:27:42] [TRT-LLM] [I] Set layernorm_quantization_plugin to None. [06/14/2024-12:27:42] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None. [06/14/2024-12:27:42] [TRT-LLM] [I] Set nccl_plugin to None. [06/14/2024-12:27:42] [TRT-LLM] [I] Set lookup_plugin to None. [06/14/2024-12:27:42] [TRT-LLM] [I] Set lora_plugin to auto. [06/14/2024-12:27:42] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None. [06/14/2024-12:27:42] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None. [06/14/2024-12:27:42] [TRT-LLM] [I] Set quantize_per_token_plugin to False. [06/14/2024-12:27:42] [TRT-LLM] [I] Set quantize_tensor_plugin to False. [06/14/2024-12:27:42] [TRT-LLM] [I] Set moe_plugin to auto. [06/14/2024-12:27:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [06/14/2024-12:27:42] [TRT-LLM] [I] Set context_fmha to False. [06/14/2024-12:27:42] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [06/14/2024-12:27:42] [TRT-LLM] [I] Set paged_kv_cache to True. [06/14/2024-12:27:42] [TRT-LLM] [I] Set remove_input_padding to True. [06/14/2024-12:27:42] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/14/2024-12:27:42] [TRT-LLM] [I] Set reduce_fusion to False. [06/14/2024-12:27:42] [TRT-LLM] [I] Set multi_block_mode to False. [06/14/2024-12:27:42] [TRT-LLM] [I] Set enable_xqa to True. [06/14/2024-12:27:42] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [06/14/2024-12:27:42] [TRT-LLM] [I] Set tokens_per_block to 64. [06/14/2024-12:27:42] [TRT-LLM] [I] Set use_paged_context_fmha to False. [06/14/2024-12:27:42] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [06/14/2024-12:27:42] [TRT-LLM] [I] Set multiple_profiles to False. [06/14/2024-12:27:42] [TRT-LLM] [I] Set paged_state to True. [06/14/2024-12:27:42] [TRT-LLM] [I] Set streamingllm to False. [06/14/2024-12:27:42] [TRT] [I] Loaded engine size: 7182 MiB [06/14/2024-12:27:43] [TRT] [I] [MS] Running engine with multi stream info [06/14/2024-12:27:43] [TRT] [I] [MS] Number of aux streams is 1 [06/14/2024-12:27:43] [TRT] [I] [MS] Number of total worker streams is 2 [06/14/2024-12:27:43] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream [06/14/2024-12:27:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 7173 (MiB) [06/14/2024-12:27:43] [TRT] [I] [MS] Running engine with multi stream info [06/14/2024-12:27:43] [TRT] [I] [MS] Number of aux streams is 1 [06/14/2024-12:27:43] [TRT] [I] [MS] Number of total worker streams is 2 [06/14/2024-12:27:43] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream [06/14/2024-12:27:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 7173 (MiB) [06/14/2024-12:27:43] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime. [06/14/2024-12:27:44] [TRT-LLM] [I] Load engine takes: 4.506639003753662 sec [06/14/2024-12:27:44] [TRT-LLM] [W] The value of max_attention_window_size should ideally not exceed max_seq_length. Therefore, it has been adjusted to match the value of max_seq_length. /usr/local/lib/python3.10/dist-packages/torch/nested/init.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.) return _nested.nested_tensor( [06/14/2024-12:27:45] [TRT] [E] 7: [shapeMachine.cpp::executeContinuation::905] Error Code 7: Internal Error (Dimensions with name batch_size_beam_width must be equal. Condition '==' violated: 1 != 3. Instruction: CHECK_EQUAL 1 3.) Traceback (most recent call last): File "/code/tensorrt_llm/examples/llama/../run.py", line 504, in
main(args)
File "/code/tensorrt_llm/examples/llama/../run.py", line 344, in main
outputs = runner.generate(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 796, in generate
outputs = self.session.decode(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 947, in wrapper
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3463, in decode
return self.decode_regular(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3073, in decode_regular
should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, context_logits, generation_logits, encoder_input_lengths = self.handle_per_step(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2732, in handle_per_step
raise RuntimeError(f"Executing TRT engine failed step={step}!")
RuntimeError: Executing TRT engine failed step=1!
additional notes