Initialization failure of TensorRT 8.5.1.7 when running bcdu model on GPU A5000

d5423197 commented 1 month ago

Description

I tried to run model (onnx) through onnxruntime with TensorrtExecutionProvider. But the initialization is failed.

Error msg:

2024-09-09 10:58:29.082851313 [E:onnxruntime:Default, tensorrt_execution_provider.h:58 log] [2024-09-09 02:58:29 ERROR] [concatenationLayer.cpp::estimateOutputDims::110] Error Code 4: Internal Error ((Unnamed Layer* 73) [Concatenation]: all concat input tensors must have the same dimensions except on the concatenation axis (1), but dimensions mismatched at index 0. Input 0 shape: [2,64,64,256], Input 1 shape: [0,64,64,256])

Environment

TensorRT Version: TensorRT 8.5.1.7

NVIDIA GPU: A5000

NVIDIA Driver Version: 11.4

CUDA Version: 11.4

CUDNN Version:

Operating System:

Python Version (if applicable): 3.8.0

Tensorflow Version (if applicable): 2.8.0

PyTorch Version (if applicable): N/A

Baremetal or Container (if so, version): N/A

Relevant Files

Model link: https://github.com/rezazad68/BCDU-Net/blob/master/Lung%20Segmentation/models.py

Steps To Reproduce

Create the tf model
Convert using tf2onnx
Initialize using onnxruntime with TensorrtExecutionProvider backend

d5423197 commented 1 month ago

Btw, I have confirmed this issue is realted to ConvLSTM2D layer. Because I have tested, if I just created the model before ConvLSTM2D layer added, the model can be initialized successfully. But if I added ConvLSTM2D layer, it will be failed.

d5423197 commented 1 month ago

import models as M
model = M.BCDU_net_D3(input_size=input_shape, traning=False)
spec = (tf.TensorSpec((1, 256, 256, 3), tf.float32, name="input"),)
model_proto, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13, output_path=out_path)

The code to get the onnx model. onnxruntime version: onnxruntime-gpu==1.12.0

moraxu commented 1 month ago

Thanks for the updated ticket info. Could you mention your OS version just for reference? Also, have you tried running the onnx with https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec rather than ORT?

d5423197 commented 1 month ago

Hi @moraxu ,

No I have not tried trtexec. I am a python user.

OS version: Ubuntu 20.04

moraxu commented 1 month ago

Oh, it's just the executable that's called like that, it can be run on Linux. As was mentioned in https://github.com/NVIDIA/TensorRT/issues/4109#issuecomment-2335112830, we'd like to be sure the issue can be isolated to TRT itself, rather than ORT. Do you have access to the instructions here: https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec ?

Can you run it like that: ./trtexec --onnx=model.onnx on your model to confirm the issue persists? I'll file an internal bug then.

d5423197 commented 1 month ago

@moraxu I installed tensorrt using pip. (The instruction from official README). I tried to build it using only tensorrt. The same error. Please check it.

Do you mean the pip version of tensorrt is different from the executable trtexec?

moraxu commented 1 month ago

Thanks, to clarify, trtexec is a standalone binary tool included with the TRT SDK (typically available when you install TRT using the tar or deb packages from NVIDIA). It helps with quick model conversion and testing, but it's separate from the pip version.

The version of TRT installed via pip should be the same as the version of trtexec, assuming they're from the same release, so the issue might be with TRT itself.

I tried to build it using only tensorrt.

Could you paste the full Python snippet here, on how you invoke the builder etc.? Apologies for the questions, I'd need that to file the bug.

d5423197 commented 1 month ago

import engine as eng
import argparse
from onnx import ModelProto
import tensorrt as trt

engine_name = "test_cseg"
onnx_path = "weights.120-0.12_fix_sim.onnx"
batch_size = 1

model = ModelProto()
with open(onnx_path, "rb") as f:
    model.ParseFromString(f.read())

d0 = model.graph.input[0].type.tensor_type.shape.dim[1].dim_value
d1 = model.graph.input[0].type.tensor_type.shape.dim[2].dim_value
d2 = model.graph.input[0].type.tensor_type.shape.dim[3].dim_value
shape = [batch_size, d0, d1, d2]
engine = eng.build_engine(onnx_path, shape=shape)
eng.save_engine(engine, engine_name)

@moraxu

moraxu commented 1 month ago

Thank you, I've instanced an internal bug, will let you know if more info is needed

moraxu commented 1 month ago

@d5423197 I was asked if you can try to run the model with a newer 10.x TRT version?

d5423197 commented 1 month ago

This is a very obvious problem. This bug is realated to tensorflow ConvLSTM2D layer. Don't they know if they have made this layer compatible? @moraxu

moraxu commented 1 month ago

@d5423197 but are you able to run this with a newer 10.x TRT version or are strictly limited to 8.5.1.7?

d5423197 commented 1 month ago

@moraxu For now, I am strictly limited to 8.5.1.7.

moraxu commented 2 weeks ago

I see. The issue has been fixed in the upcoming 10.6 release, though.

d5423197 commented 2 weeks ago

@moraxu Thanks, may I ask about the specific cause of this problem?

moraxu commented 2 weeks ago

A small issue in our vectorizer within our backend graph compiler.

NVIDIA / TensorRT