Open jkim695 opened 2 months ago
You can check your result before the plugin node ?
I inserted output nodes for the inputs of the plugin, and I found that one was producing inconsistent output (MatMul_173):
When checking the inputs and outputs for this node, the inputs (query_layer & MatMul_423) are consistent across multiple runs, but the output is not.
Iteration 1:
query_layer: tensor([ 0.6668, -0.0100, 0.7874, ..., -0.0217, -0.1060, 0.1076])
MatMul_423: tensor([ 1.4557, -1.4124, -1.1299, ..., 1.1094, 1.0816, 1.1094])
Div_424: tensor([-0.8541, -1.2864, -0.9897, -1.1466, -1.5251, -1.0459])
Iteration 2:
query_layer: tensor([ 0.6668, -0.0100, 0.7874, ..., -0.0217, -0.1060, 0.1076])
MatMul_423: tensor([ 1.4557, -1.4124, -1.1299, ..., 1.1094, 1.0816, 1.1094])
Div_424: tensor([[-0.5895, -1.2377, -1.0294, -1.1611, -1.2238, -1.1667]])
Iteration 3:
query_layer: tensor([ 0.6668, -0.0100, 0.7874, ..., -0.0217, -0.1060, 0.1076])
MatMul_423: tensor([ 1.4557, -1.4124, -1.1299, ..., 1.1094, 1.0816, 1.1094])
Div_424: tensor([[-0.4970, -1.1129, -0.8866, -0.9455, -1.3411, -0.9709]])
I also observed that this node produced consistent output when running with the model without the plugin:
Iteration 1:
query_layer: tensor([ 0.6680, -0.0111, 0.7866, ..., -0.0217, -0.1060, 0.1076])
Mat_mul 423: tensor([ 1.4551, -1.4111, -1.1299, ..., 1.1094, 1.0820, 1.1094])
Div_424: tensor([[-0.5854, -1.2207, -1.0205, -1.0791, -0.9033, -0.9243]])
Iteration 2:
query_layer: tensor([ 0.6680, -0.0111, 0.7866, ..., -0.0217, -0.1060, 0.1076])
Mat_mul 423: tensor([ 1.4551, -1.4111, -1.1299, ..., 1.1094, 1.0820, 1.1094])
Div_424: tensor([[-0.5854, -1.2207, -1.0205, -1.0791, -0.9033, -0.9243]])
Iteration 3:
query_layer: tensor([ 0.6680, -0.0111, 0.7866, ..., -0.0217, -0.1060, 0.1076])
Mat_mul 423: tensor([ 1.4551, -1.4111, -1.1299, ..., 1.1094, 1.0820, 1.1094])
Div_424: tensor([[-0.5854, -1.2207, -1.0205, -1.0791, -0.9033, -0.9243]])
Is there an explanation for why inserting the plugin node causes inconsistencies in this MatMul node?
Description
When running inference with TensorRT's disentangled attention plugin on Microsoft's implementation of DeBERTa , I noticed that I get inconsistent output when running with dynamic sequence length in my inputs. This can be reproduced by using various input sequence lengths that are less than the max set for the optimization profile in the created TensorRT engine:
I get consistent output when running inference on the model without the plugin with the same script :
These results may suggest a bug with the optimization profile or the disentangled attention plugin.
Environment
TensorRT Version: 10.2
NVIDIA GPU: Tesla V100
NVIDIA Driver Version: 535.171.04
CUDA Version: 12.4
CUDNN Version: N/A
Operating System: ubuntu 20.04
Python Version (if applicable):3.8.1
Tensorflow Version (if applicable): N/A
PyTorch Version (if applicable): 1.11
Baremetal or Container (if so, version): N/A
Relevant Files
Model link: https://huggingface.co/microsoft/deberta-v3-xsmall, pulled from transformers repository version 4.22.0
Steps To Reproduce
Have you tried the latest release?: Yes
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
): Yes, I've tested it with onnxrt with cudnn 8.9 and CUDA 11.8, still had inconsistent output when running with plugin.