Tensorrt model gives wrong output

Description

Hi, I tried to export a onnx format ber-base-chinese model to tensorrt engine with tensorrt version8.6.1. During the whole process the output log shows no error message, but when inferring, the tensorrt model always gives wrong predictions.

Environment

TensorRT Version:8.6.1

NVIDIA GPU:RTX 3060 Laptop

CUDA Version: 11.8

CUDNN Version:8700

Operating System: windows 11

Python Version (if applicable): python 3.10.13

Pytorch Version: 2.0.1

Baremetal or Container (if so, version): none

Steps To Reproduce

First I use optimum-cli command to export the huggingface bert-base-chinese model to onnx format with dynamic input batches:

optimum-cli export onnx --model bert-base-chinese --task fill-mask  ../onnx_export_bert_chinese

Here's the onnx model that provides correct answers (exactly the same predictions and probs as the original pytorch bert-base-chinese model gives) when inferring, if anyone needs it: https://drive.google.com/drive/folders/1whFFgmQ5IP_crFlxxbGsbq8QhTpW3zyc?usp=sharing

Then I use the following code to export the onnx model to tensorrt engine format

import tensorrt as trt
import torch

saved_trt_engine_path = 'models/bert_trt.engine' ###local trt engine path
onnx_model_path = 'models/onnx_export_bert_chinese/model.onnx' ### local onnx model path

logger = trt.Logger(trt.Logger.VERBOSE)
trt.init_libnvinfer_plugins(logger, '')

builder = trt.Builder(logger)

network = builder.create_network(1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))

parser = trt.OnnxParser(network, logger)
parser.parse_from_file(onnx_model_path)    
config = builder.create_builder_config()
profile = builder.create_optimization_profile()

profile.set_shape(input="input_ids", min=(1, 128), opt=(1, 128), max=(1, 128))
profile.set_shape(input="attention_mask", min=(1, 128), opt=(1, 128), max=(1, 128))
profile.set_shape(input="token_type_ids", min=(1, 128), opt=(1, 128), max=(1, 128))
config.add_optimization_profile(profile)

serialized_engine = builder.build_serialized_network(network, config)
with open(saved_trt_engine_path, 'wb') as f:
    f.write(serialized_engine)

The whole process takes about 1 minute with no error logs. Here's the exported tensorrt engine file: https://drive.google.com/file/d/16CNgLcNlwJfuEbqvgA4U4yL_voAFgxIl/view?usp=sharing Finally use the following code to do infer with the tensorrt engine:

cuda.init()
saved_trt_engine_path = 'models/bert_trt.engine' ###local trt engine path

logger = trt.Logger(trt.Logger.VERBOSE)
runtime = trt.Runtime(logger)

with open(saved_trt_engine_path, 'rb') as f:
    serialized_engine = f.read()
    engine = runtime.deserialize_cuda_engine(serialized_engine)
context = engine.create_execution_context()

tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
text = "[MASK]是一个测试"

encoded_text = tokenizer.encode_plus(text, max_length=128, truncation=True, padding='max_length', return_tensors="np")
input_ids = encoded_text["input_ids"]
token_type_ids = encoded_text["token_type_ids"]
attention_mask = encoded_text["attention_mask"]
output = torch.zeros(1, 128, 21128).cpu().detach().numpy()

### allocate cuda memory; in the test, we assume the input batch_size is 1
batch_size = 1
d_input_ids = cuda.mem_alloc(batch_size * input_ids.nbytes)  
d_token_type_ids = cuda.mem_alloc(batch_size * token_type_ids.nbytes)
d_attention_mask = cuda.mem_alloc(batch_size * attention_mask.nbytes)
d_output = cuda.mem_alloc(batch_size * output.nbytes)

stream = cuda.Stream()
bindings = [int(d_input_ids), int(d_token_type_ids), int(d_attention_mask), int(d_output)]

def do_inference_v2(context, bindings, stream):
    cuda.memcpy_htod_async(d_input_ids, input_ids, stream)     
    cuda.memcpy_htod_async(d_token_type_ids, token_type_ids, stream)
    cuda.memcpy_htod_async(d_attention_mask, attention_mask, stream)
    context.execute_async_v2(bindings, stream.handle, None)
    cuda.memcpy_dtoh_async(output, d_output, stream)  
    stream.synchronize()
    return output

trt_output = do_inference_v2(context, bindings=bindings, stream=stream)     

mask_token_index = input_ids[0].tolist().index(tokenizer.mask_token_id)
###Get the output vector for [MASK] token
mask_token_output = trt_output[0, mask_token_index, :]
print(mask_token_output) 
predicted_token_index = mask_token_output.argmax()
print(predicted_token_index)
### Get the corresponding Chinese character for the predicted index
predicted_chinese_character = tokenizer.convert_ids_to_tokens([predicted_token_index])[0]
### Replace [MASK] with the predicted Chinese character in the input sentence
updated_sentence = text.replace("[MASK]", predicted_chinese_character)
print("Inferred Sentence:", updated_sentence)

No matter what the input sensors are, the model always gives wrongs prediction tokens. I also tried to export onnx model and tensorrt engine into fp32 precision, and the outputs are samely wrong.

Have you tried the latest release?: Yes. I also tried on a previous version(8.5.1.7) and it gives the same outputs.

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): I inferred with the onnx model(the model in the code above:"models/onnx_export_bert_chinese/model.onnx") several times, and it always gives correct answer (exactly the same predictions and probs as the original pytorch bert-base-chinese model gives). So it doesn't seem like there's any error with the onnx model. I also tried using the trtexec command(with the same params as the code above) to convert onnx model to tensorrt engine. It gives the same wrong predictions.

Any help is appreciated. >-<

NVIDIA / TensorRT