Closed jkl375 closed 9 months ago
@symphonylyh , can you re-assign this issue to someone in the team, please?
Later I try again and when setting batch size from 1 to 6, it can run the results normally. But when the batchsize is greater than 6, the error will be occurred. Is any parameter that has not been set properly?
root@f0d49b6037c6:/workspace/examples/multimodal# python run.py --max_new_tokens 50 --batch_size 6 --input_text "Question: 图里面有什么? Answer:"
--hf_model_dir /workspace/examples/multimodal/llava-v1.5-7b --visual_engine_dir visual_engines/llava-v1.5-7b --llm_engine_dir trt_engines/llava-v1.5-7b/fp16/1-g
pu --decoder_llm
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:20<00:00, 10.39s/it]
[01/23/2024-06:59:16] [TRT-LLM] [I] Loading engine from visual_engines/llava-v1.5-7b/visual_encoder_fp16.engine
[01/23/2024-06:59:17] [TRT-LLM] [I] Creating session from engine visual_engines/llava-v1.5-7b/visual_encoder_fp16.engine
[01/23/2024-06:59:17] [TRT] [I] Loaded engine size: 599 MiB
[01/23/2024-06:59:17] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +595, now: CPU 0, GPU 595 (MiB)
[01/23/2024-06:59:17] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +396, now: CPU 0, GPU 991 (MiB)
[01/23/2024-06:59:29] [TRT] [I] Loaded engine size: 12855 MiB
[01/23/2024-06:59:32] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13055, GPU 27837 (MiB)
[01/23/2024-06:59:32] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 13057, GPU 27847 (MiB)
[01/23/2024-06:59:32] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[01/23/2024-06:59:32] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12853, now: CPU 0, GPU 13844 (MiB)
[01/23/2024-06:59:32] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13076, GPU 28179 (MiB)
[01/23/2024-06:59:32] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 13077, GPU 28187 (MiB)
[01/23/2024-06:59:32] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[01/23/2024-06:59:33] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13844 (MiB)
[01/23/2024-06:59:33] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13100, GPU 28205 (MiB)
[01/23/2024-06:59:33] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 13100, GPU 28215 (MiB)
[01/23/2024-06:59:33] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[01/23/2024-06:59:33] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13844 (MiB)
[01/23/2024-06:59:33] [TRT-LLM] [I] Load engine takes: 16.056230783462524 sec
[01/23/2024-06:59:35] [TRT] [W] Using default stream in enqueue()/enqueueV2()/enqueueV3() may lead to performance issues due to additional cudaDeviceSynchronize() calls by TensorRT to ensure correct synchronizations. Please use non-default stream instead.
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:165: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
[01/23/2024-07:01:12] [TRT-LLM] [I] ---------------------------------------------------------
[01/23/2024-07:01:12] [TRT-LLM] [I]
[Q] Question: 图里面有什么? Answer:
[01/23/2024-07:01:12] [TRT-LLM] [I]
[A] [['图里面有一个巨大的喷泉,喷泉水从高处喷出,周围是一座建筑,还有一个巨大'], ['图里面有一个巨大的喷泉,喷泉水从高处喷出,周围是一座建筑,还有一个巨大'], ['图里面有一个巨大的喷泉,喷泉水从高处喷出,周围是一座建筑,还有一个巨大'], ['图里面有一个巨大的喷泉,喷泉水从高处喷出,周围是一座建筑,还有一个巨大'], ['图里面有一个巨大的喷泉,喷泉水从高处喷出,周围是一座建筑,还有一个巨大'], ['图里面有一个巨大的喷泉,喷泉水从高处喷出,周围是一座建筑,还有一个巨大']]
[01/23/2024-07:01:12] [TRT-LLM] [I] TensorRT-LLM LLM latency: 0.8995234870910644 sec
[01/23/2024-07:01:12] [TRT-LLM] [I] ---------------------------------------------------------
who can help me?
Hi @jkl375 , thanks for providing the detaile reproducer, and sorry for delaying my response.
From your post, I have two action items planned: (1) build_visual_engine.py
should take cmdline args for the max batch size (2) reproduce bs>6 failure on our end next, currently it shows it passes the visual encoder part and fails at the LLM part.
One request I have from you, can you please edit the description above and briefly list what modifications you did to run.py
? That'd be clearer for us to understand what changes you made
OK. Here are my modifications:
pre_prompt = [pre_prompt] * args.batch_size
post_prompt = [post_prompt] * args.batch_size
under pre_prompt, post_prompt = setup_llava_prompt(args.input_text)
image = image.expand(args.batch_size, -1, -1,
-1).contiguous()
under image = image_processor(image, return_tensors='pt')['pixel_values']
input_lengths = input_lengths.repeat(32, 1)
under profiler.stop("LLM")
Hi @jkl375, I tested up to batch size 32 and could run the example without error.
Your changes to pre_prompt
, post_prompt
and image
dimensions are correct. But input_lengths
should be repeated batch_size
times before the call to setup_fake_prompts()
. In summary, you should replace
input_atts = torch.ones((1, length)).to(torch.int32).to("cuda")
input_lengths = torch.sum(input_atts, dim=1)
with
input_lengths = torch.IntTensor([length] * args.batch_size).to(torch.int32).to("cuda")
Make sure to remove your last change input_lengths = input_lengths.repeat(32, 1)
Let us know if you still face errors.
Hi @amukkara, I am experiencing exactly the same problem that @jkl375 describes. I have setup the prompts and lengths in an almost identical way as you have both described but I am unable to get a model inference with a batch_size > 6. All values under 6 work fine.
My error message is identical to @jkl375. And again the visual embedder works fine, tested up to batch size of 64.
My engine building command is:
python ../llama/build.py \
--model_dir hf_models/llava-v1.5-13b \
--output_dir llava/8_batch/trt_engines/fp16/llava-v1.5-13b \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--max_batch_size 8 \
--max_prompt_embedding_table_size 4608 #576*8
Is it possible to be a fault in the engine building process at all rather than the inference runner? As if batchsize > 6 is working correctly, this suggests that the inference script is fine. Running on A100 GPU.
Hi @fhudson96, can you test using the latest scripts? There have been several changes in the last few weeks and it is possible an issue with building process was fixed in latest release.
Hi, @amukkara. There is no "example/llama/build.py" in the latest branch. And no max_multimodal_len in example/llama/convert_checkpoint.py.
Hi @jkl375, the build process for llama model has changed in recent update. Please use the following commands to build llama as explained in examples/llama/README.md
python ../llama/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_batch_size 2 \
--max_input_len 2048 \
--max_output_len 512 \
--max_multimodal_len 1152 # 2 (max_batch_size) * 576 (num_visual_features)
When I try to run the following code:
python ../llama/convert_checkpoint.py \
--model_dir /workspace/examples/multimodal/llava-v1.5-7b \
--output_dir ./tllm_checkpoint \
--dtype float16
the following error occurred:
Traceback (most recent call last):
File "/workspace/examples/multimodal/../llama/convert_checkpoint.py", line 1973, in <module>
main()
File "/workspace/examples/multimodal/../llama/convert_checkpoint.py", line 1700, in main
'architecture': hf_config.architectures[0]
TypeError: 'NoneType' object is not subscriptable
I found that when transformer==4.36.1, the value of hf_config is
LlamaConfig {
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"transformers_version": "4.36.1",
"use_cache": true,
"vocab_size": 32000
}
when executing the code hf_config = LlavaConfig.from_pretrained(args.model_dir).text_config
There is no architectures in hf_config.
Sorry, I downloaded the wrong model. It should be llava-1.5-7b-hf. It works!
@jkl375 what is your environment?my:
hi , can you run llava-v1.5-7b model successfully?
System Info
Who can help?
@symphonylyh
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
examples/llama/README.md
(I want to use a batch size of 32)examples/multimodal/build_visual_engine.py
import numpy as np import requests import tensorrt as trt import torch from PIL import Image from transformers import (AutoConfig, AutoTokenizer, Blip2ForConditionalGeneration, Blip2Processor)
import tensorrt_llm import tensorrt_llm.profiler as profiler from tensorrt_llm import logger from tensorrt_llm._utils import torch_to_numpy from tensorrt_llm.runtime import ModelRunner, Session, TensorInfo
sys.path.append(str(Path(file).parent.parent)) from enc_dec.run import TRTLLMEncDecModel
def parse_arguments(): parser = argparse.ArgumentParser() parser.add_argument('--max_new_tokens', type=int, default=30) parser.add_argument('--batch_size', type=int, default=1) parser.add_argument('--log_level', type=str, default='info') parser.add_argument('--visual_engine_dir', type=str, default=None, help='Directory containing visual TRT engines') parser.add_argument('--llm_engine_dir', type=str, default=None, help='Directory containing TRT-LLM engines') parser.add_argument('--hf_model_dir', type=str, default=None, help="Directory containing tokenizer") parser.add_argument( '--decoder_llm', action='store_true', help='Whether LLM is decoder-only or an encoder-decoder variant?') parser.add_argument('--blip_encoder', action='store_true', help='Whether visual encoder is a BLIP model') parser.add_argument('--input_text', type=str, default='Question: which city is this? Answer:', help='Text prompt to LLM') parser.add_argument('--num_beams', type=int, help="Use beam search if num_beams >1", default=1) parser.add_argument('--top_k', type=int, default=1)
def trt_dtype_to_torch(dtype): if dtype == trt.float16: return torch.float16 elif dtype == trt.float32: return torch.float32 elif dtype == trt.int32: return torch.int32 else: raise TypeError("%s is not supported" % dtype)
class MultiModalModel:
def setup_llava_prompt(query):
Import these here to avoid installing llava when running blip models only
def load_test_image(): img_url = 'https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png' return Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
if name == 'main': os.environ["TOKENIZERS_PARALLELISM"] = "false" args = parse_arguments() tensorrt_llm.logger.set_level(args.log_level) runtime_rank = tensorrt_llm.mpi_rank()
python run.py \ --max_new_tokens 50 \ --batch_size 32 \ --input_text "Question: 图里面有什么? Answer:" \ --hf_model_dir /workspace/examples/multimodal/llava-v1.5-7b \ --visual_engine_dir visual_engines/llava-v1.5-7b \ --llm_engine_dir trt_engines/llava-v1.5-7b/fp16/1-gpu \ --decoder_llm
root@f0d49b6037c6:/workspace/examples/multimodal# python run.py --max_new_tokens 50 --batch_size 32 --input_text "Question: 图里面有什么? Answer:" --hf_model_dir /workspace/examples/multimodal/llava-v1.5-7b --visual_engine_dir visual_engines/llava-v1.5-7b --llm_engine_dir trt_engines/llava-v1.5-7b/fp16/1-gpu --decoder_llm Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.38s/it] [01/22/2024-03:31:15] [TRT-LLM] [I] Loading engine from visual_engines/llava-v1.5-7b/visual_encoder_fp16.engine [01/22/2024-03:31:15] [TRT-LLM] [I] Creating session from engine visual_engines/llava-v1.5-7b/visual_encoder_fp16.engine [01/22/2024-03:31:15] [TRT] [I] Loaded engine size: 599 MiB [01/22/2024-03:31:15] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +595, now: CPU 0, GPU 595 (MiB) [01/22/2024-03:31:16] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +396, now: CPU 0, GPU 991 (MiB) [01/22/2024-03:31:26] [TRT] [I] Loaded engine size: 12855 MiB [01/22/2024-03:31:28] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13055, GPU 27837 (MiB) [01/22/2024-03:31:28] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 13057, GPU 27847 (MiB) [01/22/2024-03:31:28] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [01/22/2024-03:31:28] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12853, now: CPU 0, GPU 13844 (MiB) [01/22/2024-03:31:28] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13076, GPU 28179 (MiB) [01/22/2024-03:31:28] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 13077, GPU 28187 (MiB) [01/22/2024-03:31:28] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [01/22/2024-03:31:28] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13844 (MiB) [01/22/2024-03:31:28] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13100, GPU 28205 (MiB) [01/22/2024-03:31:28] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 13100, GPU 28215 (MiB) [01/22/2024-03:31:28] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [01/22/2024-03:31:28] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13844 (MiB) [01/22/2024-03:31:28] [TRT-LLM] [I] Load engine takes: 12.56920576095581 sec [01/22/2024-03:31:29] [TRT] [W] Using default stream in enqueue()/enqueueV2()/enqueueV3() may lead to performance issues due to additional cudaDeviceSynchronize() calls by TensorRT to ensure correct synchronizations. Please use non-default stream instead. /usr/local/lib/python3.10/dist-packages/torch/nested/init.py:165: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.) return _nested.nested_tensor( [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:30] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:31] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:31] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:31] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:31] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:31] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:31] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:31] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:31] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:31] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [01/22/2024-03:31:31] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) Traceback (most recent call last): File "/workspace/examples/multimodal/run.py", line 399, in
stripped_text = model.generate(pre_prompt, post_prompt, image,
File "/workspace/examples/multimodal/run.py", line 159, in generate
output_ids = self.model.generate(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 584, in generate
outputs = self.session.decode(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 746, in wrapper
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2789, in decode
return self.decode_regular(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2450, in decode_regular
should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2136, in handle_per_step
raise RuntimeError(f"Executing TRT engine failed step={step}!")
RuntimeError: Executing TRT engine failed step=0!
root@f0d49b6037c6:/workspace/examples/multimodal#