NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.6k stars 976 forks source link

[BUG] LLaVA 13B batch_size=2 推理报错 #1382

Closed DefTruth closed 7 months ago

DefTruth commented 7 months ago

System Info

NVIDIA L20 CUDA 12.3 TensorRT-LLM 0.9.0.dev2024032600

Who can help?

@byshiue

Information

Tasks

Reproduction

按照max_batch_size=4编译engine后,执行命令:

mpirun --allow-run-as-root -n 2 python3 run.py \
    --max_new_tokens 1 \
    --hf_model_dir $HF_MODELS/$MODEL_NAME \
    --visual_engine_dir visual_engines/$MODEL_NAME \
    --llm_engine_dir trt_engines/$MODEL_NAME/fp16/2-gpu \
    --batch_size 2 \
    --decoder_llm --run_profiling \
    --input_text "Question: 请问下如果去这个地方旅游,我应该注意什么?提示:图片中包含一个很大的湖泊,湖泊的中间有一个木板制作的木桥梁,桥梁呈现T字型,桥梁的 尽头连接这一个平台,平台上有围栏,但是没有全部围住。湖泊的两边都是树林,树木很高。湖泊的尽头是高山,雨雾缭绕。请详细描述一下。 Answer:"

LLaVA 13B batch_size=2推理报错

[03/30/2024-19:32:51] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[03/30/2024-19:32:51] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[03/30/2024-19:32:52] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[03/30/2024-19:32:52] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[03/30/2024-19:32:52] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
Traceback (most recent call last):
  File "/workspace/dev/openllm/TensorRT-LLM/examples/multimodal/run.py", line 464, in <module>
    model.generate(pre_prompt,
  File "/workspace/dev/openllm/TensorRT-LLM/examples/multimodal/run.py", line 218, in generate
    output_ids = self.model.generate(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 692, in generate
    outputs = self.session.decode(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 789, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2993, in decode
    return self.decode_regular(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2642, in decode_regular
[03/30/2024-19:32:52] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
    should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, context_logits, generation_logits, encoder_input_lengths = self.handle_per_step(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2334, in handle_per_step
Traceback (most recent call last):
  File "/workspace/dev/openllm/TensorRT-LLM/examples/multimodal/run.py", line 464, in <module>
    raise RuntimeError(f"Executing TRT engine failed step={step}!")
RuntimeError: Executing TRT engine failed step=0!
    model.generate(pre_prompt,
  File "/workspace/dev/openllm/TensorRT-LLM/examples/multimodal/run.py", line 218, in generate
    output_ids = self.model.generate(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 692, in generate
    outputs = self.session.decode(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 789, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2993, in decode
    return self.decode_regular(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2642, in decode_regular
    should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, context_logits, generation_logits, encoder_input_lengths = self.handle_per_step(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2334, in handle_per_step
    raise RuntimeError(f"Executing TRT engine failed step={step}!")
RuntimeError: Executing TRT engine failed step=0!

Expected behavior

no bug

actual behavior

get error

additional notes

null

DefTruth commented 7 months ago

set max_multimodal_len = (max_batch_size) * 576 (num_visual_features) and this problem solve