Error when generating TensorRT engines for visual components for LLava models

System Info

-GPU A100 NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2
NVIDIA A100-SXM4-80GB

Who can help?

@byshiue @kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava-llava-next-and-vila

python /opt/tensorrtllm_backend/tensorrt_llm/examples/multimodal/build_visual_engine.py \ --model_type llava \ --model_path /models/${MODEL_NAME} \ --max_batch_size 5 --output_dir /models/llava/vit_trt_engines/${MODEL_NAME}

Expected behavior

vit_trt_engines should be generated

actual behavior

RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 3, 3, 336, 336]

Full logs below:

[TensorRT-LLM] TensorRT-LLM version: 0.10.0 /usr/local/lib/python3.10/dist-packages/transformers/models/llava/configuration_llava.py:100: FutureWarning: The vocab_size argument is deprecated and will be removed in v4.42, since it can be inferred from the text_config. Passing this argument has no effect warnings.warn( Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:08<00:00, 1.84it/s] [08/23/2024-20:51:14] [TRT] [I] Exporting onnx Traceback (most recent call last): File "/opt/tensorrtllm_backend/tensorrt_llm/examples/multimodal/build_visual_engine.py", line 576, in builder.build() File "/opt/tensorrtllm_backend/tensorrt_llm/examples/multimodal/build_visual_engine.py", line 79, in build build_llava_engine(args) File "/opt/tensorrtllm_backend/tensorrt_llm/examples/multimodal/build_visual_engine.py", line 309, in build_llava_engine export_visual_wrapper_onnx(wrapper, image, args.output_dir) File "/opt/tensorrtllm_backend/tensorrt_llm/examples/multimodal/build_visual_engine.py", line 106, in export_visual_wrapper_onnx torch.onnx.export(visual_wrapper, File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 516, in export _export( File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 1613, in _export graph, params_dict, torch_out = _model_to_graph( File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 1135, in _model_to_graph graph, params, torch_out, module = _create_jit_graph(model, args) File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 1011, in _create_jit_graph graph, torch_out = _trace_and_get_graph_from_model(model, args) File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 915, in _trace_and_get_graph_from_model trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph( File "/usr/local/lib/python3.10/dist-packages/torch/jit/_trace.py", line 1296, in _get_trace_graph outs = ONNXTracedModule( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/jit/_trace.py", line 138, in forward graph, out = torch._C._create_graph_by_tracing( File "/usr/local/lib/python3.10/dist-packages/torch/jit/_trace.py", line 129, in wrapper outs.append(self.inner(trace_inputs)) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _slow_forward result = self.forward(*input, kwargs) File "/opt/tensorrtllm_backend/tensorrt_llm/examples/multimodal/build_visual_engine.py", line 298, in forward all_hidden_states = self.tower( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _slow_forward result = self.forward(*input, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py", line 926, in forward return self.vision_model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _slow_forward result = self.forward(*input, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py", line 850, in forward hidden_states = self.embeddings(pixel_values) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _slow_forward result = self.forward(*input, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py", line 185, in forward patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [, width, grid, grid] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _slow_forward result = self.forward(input, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 460, in forward return self._conv_forward(input, self.weight, self.bias) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 456, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 3, 3, 336, 336]

additional notes

None

NVIDIA / TensorRT-LLM