Pytorch and TensorRT 3D Conv different result.

Nioolek commented 4 years ago

Description

I tried to convert 3D CONV model from pytorch to onnx to tensorrt. Everything seems to work well. I tried to inference the model in pytorch 、onnx and tensorrt. The inference results of pytorch and onnx are same,but the inference result of onnx and tensorrt are different. So I located the problem at trt engine.

What I have checked: Input shape, onnx model(checked in Netron)

Environment

TensorRT Version: 7.0.0.11 GPU Type: 2080Ti * 2 Nvidia Driver Version: 440.33.01 CUDA Version: 10.0 CUDNN Version: 7.6 Operating System + Version: ubuntu 16.04 Python Version (if applicable): 3.6.9 TensorFlow Version (if applicable): PyTorch Version (if applicable): 1.2.0 Baremetal or Container (if container which image + tag): build tensorrt container by myself according to official instructions

Relevant Files

https://drive.google.com/open?id=1oZ550uIm-IzM0E4CpUc-rtlGdjzVlMSj

Steps To Reproduce

I use the code download from "https://github.com/rmccorm4/tensorrt-utils/blob/master/classification/imagenet/onnx_to_tensorrt.py" to convert onnx model to tensorrt. Code instructions: python onnx_to_tensorrt.py --onnx 3dcnn.onnx -o 3dcnn_docker1.trt -b 1 -v --explicit-batch --gpu-fallback --calibration-batch-size 1

Log:

2020-04-10 10:05:46 - __main__ - INFO - TRT_LOGGER Verbosity: Severity.INFO
2020-04-10 10:05:46 - __main__ - INFO - Setting BuilderFlag.GPU_FALLBACK
[TensorRT] WARNING: /workspace/TensorRT/parsers/onnx/onnx2trt_utils.cpp:216: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TensorRT] WARNING: /workspace/TensorRT/parsers/onnx/onnx2trt_utils.cpp:216: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TensorRT] WARNING: Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
2020-04-10 10:05:47 - __main__ - DEBUG - === Network Description ===
2020-04-10 10:05:47 - __main__ - DEBUG - Input  0 | Name: 0   | Shape: (1, 3, 10, 128, 128)
2020-04-10 10:05:47 - __main__ - DEBUG - Output 0 | Name: 177 | Shape: (-1, 2)
2020-04-10 10:05:47 - __main__ - DEBUG - === Optimization Profiles ===
2020-04-10 10:05:47 - __main__ - DEBUG - 0 - OptProfile 0 - Min (1, 3, 10, 128, 128) Opt (1, 3, 10, 128, 128) Max (1, 3, 10, 128, 128)
2020-04-10 10:05:47 - __main__ - INFO - Building Engine...
[TensorRT] WARNING: Setting layouts of network and plugin input/output tensors to linear, as 3D operators are found and 3D non-linear IO formats are not supported, yet.
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
2020-04-10 10:05:49 - __main__ - INFO - Serializing engine to file: 3dcnn_docker1.trt

Inference code is referenced from https://github.com/rmccorm4/tensorrt-utils/blob/master/classification/imagenet/infer_tensorrt_imagenet.py

Is this caused by INT64 param?

XinnWang commented 4 years ago

@Nioolek i met the same problem, have you solved it?

111qqz commented 4 years ago

same problem here. I also tried to convert 3D CONV model from pytorch to caffe to tensorrt,trt inference result is wrong.

ttyio commented 4 years ago

Hello @Nioolek , thanks for reporting. We release a new tool in 7.2 to compare trt run with other framework. Please check https://github.com/NVIDIA/TensorRT/tree/release/7.2/tools/Polygraphy

And for you case, you can run with command line:

polygraphy run 3dcnn.onnx --trt --onnxrt

I get result

[I] Runner: onnxrt-runner-N0-10/23/20-00:55:07 | Completed 1 iterations. [I] Accuracy Comparison | trt-runner-N0-10/23/20-00:55:07 vs. onnxrt-runner-N0-10/23/20-00:55:07 [I] Comparing Output: '177' (dtype=float32, shape=(1, 2)) with '177' (dtype=float32, shape=(1, 2)) [S] PASSED | Difference is within tolerance (rtol=1e-05, atol=1e-05) [S] PASSED | Command: /home/vincenth/.local/bin/polygraphy run 3dcnn.onnx --trt --onnxrt

The small mismatch could introduced by floating point arithmetic sequence. The DNNs are by nature robust against perturbation most of the time, this is why fp16/INT8 works. Have you compared the end2end accuracy instead of bit level mismatch?

Nioolek commented 4 years ago

Hello @Nioolek , thanks for reporting. We release a new tool in 7.2 to compare trt run with other framework. Please check https://github.com/NVIDIA/TensorRT/tree/release/7.2/tools/Polygraphy

And for you case, you can run with command line:

polygraphy run 3dcnn.onnx --trt --onnxrt

I get result

[I] Runner: onnxrt-runner-N0-10/23/20-00:55:07 | Completed 1 iterations. [I] Accuracy Comparison | trt-runner-N0-10/23/20-00:55:07 vs. onnxrt-runner-N0-10/23/20-00:55:07 [I] Comparing Output: '177' (dtype=float32, shape=(1, 2)) with '177' (dtype=float32, shape=(1, 2)) [S] PASSED | Difference is within tolerance (rtol=1e-05, atol=1e-05) [S] PASSED | Command: /home/vincenth/.local/bin/polygraphy run 3dcnn.onnx --trt --onnxrt

The small mismatch could introduced by floating point arithmetic sequence. The DNNs are by nature robust against perturbation most of the time, this is why fp16/INT8 works. Have you compared the end2end accuracy instead of bit level mismatch?

If I have time, I will take a test.The problem remains unsolved.We used libtorch to inference 3DCNN before, but this doesn't seem to be the best choice for an online inference environment.

ttyio commented 4 years ago

thanks @Nioolek , do you see accuracy loss in trt with real data?

ttyio commented 4 years ago

Sorry @Nioolek @XinnWang @111qqz , I was using an internal nightly in previous comment, now the issue can be reproduced after use the 7.2 release. Now we have internal track for this issue, and before we fixed that, I have a script to modify your onnx to workaround this issue, and I have verified using your model. Could you take a try? thanks!

import onnx
model = onnx.load('3dcnn.onnx')
graph = model.graph
nodes = graph.node
initlist = [init.name for init in graph.initializer]
for node1 in nodes:
    if node1.op_type == "Add":
        consInput = None
        if node1.input[0] in initlist:
            consInput = node1.input[0]
        if node1.input[1] in initlist:
            consInput = node1.input[1]
        if consInput:
            idOutput0 = "ident_{}".format(consInput)
            nodeIdent = onnx.helper.make_node(
                'Identity',
                [consInput], # inputs
                [idOutput0], # outputs
            )
            node1.input.remove(consInput)
            node1.input.extend([idOutput0])
            nodes.extend([nodeIdent])
model_def = onnx.helper.make_model(graph)
onnx.save(model_def, './update_model.onnx')

ttyio commented 3 years ago

I will close this, please reopen if you still have question, thanks!

Nioolek commented 3 years ago

@XinnWang @111qqz I have not solved this problem.I used libtorch to inference the network instead.

NVIDIA / TensorRT