NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.48k stars 2.1k forks source link

wrong results of TensorRT 10.0 when running on GPU Tesla T4 #3999

Open yflv-yanxia opened 1 month ago

yflv-yanxia commented 1 month ago

Description

The output of the TensorRT 10 model converted from ONNX is incorrect, while the output of the TensorRT 8.6 model is correct. The issue seems to be located in some fully connected layers in the TensorRT 10 model, where the error in the output suddenly becomes very large. The exact cause is unknown. Please help to resolve this issue.

Environment

TensorRT Version: TensorRT 10.0.1

NVIDIA GPU: Tesla T4

NVIDIA Driver Version: 450.36.06

CUDA Version: 11.0

CUDNN Version:8.0.0

Operating System:

onnx opset17

Relevant Files

Model link: https://drive.google.com/file/d/1QBbmtdaecWAHzqMdh10QVbdSjTWzleqo/view?usp=sharing

Steps To Reproduce

  1. Convert the ONNX model to TensorRT 10 using ./trtexec --onnx=./test.onnx --device=0 --saveEngine=./test.trtmodel --precisionConstraints=obey.
lix19937 commented 1 month ago

Run follow ,then upload the log.

 ./trtexec --onnx=./test.onnx --device=0 --saveEngine=./test.trtmodel   --verbose  2>&1 |tee  build.log
yflv-yanxia commented 1 month ago

build.log This is the log @lix19937

lix19937 commented 1 month ago

From your log,

[07/15/2024-08:14:59] [E] Error[1]: [cudaResources.cpp::haveStreamOrderedAllocatorHelper::15] Error Code 1: Cuda Runtime (invalid argument)

maybe has problem, also just a warning, need check.
Use follow cmd to compare the result with onnxruntime

polygraphy run test.onnx --trt --onnxrt   
yflv-yanxia commented 1 month ago

log_testonnx.txt Here are the results after running the above instructions. @lix19937

lix19937 commented 1 month ago

The result has a big diff.

Use follow cmd @ trt8.6 and trt10.0 ,and then upload the two li.json


trtexec --onnx=model_sim.onnx --verbose   \
--dumpProfile --dumpLayerInfo --separateProfileRun \
--noDataTransfers --useCudaGraph --useSpinWait --profilingVerbosity=detailed  --exportLayerInfo=li.json
yflv-yanxia commented 1 month ago

li_10.json li_86.json Here are the results. @lix19937

lix19937 commented 1 month ago

They choose some different tactic, btw two build log can provide ? They run at the same machine ?

lix19937 commented 1 month ago

Another, you can try the TensorRT v10.2 .

yflv-yanxia commented 1 month ago

build.log This is the log @lix19937

The build log for TRT10 has been provided before. Below is the build log for TRT8.6. build_86.log The previously provided li_86.json was not obtained on the same machine. I have now obtained li_86.json on the same machine. li_86(1).json @lix19937

lix19937 commented 3 weeks ago

You can try to use follow cmd

 polygraphy run model_sim.onnx --trt --onnxrt \
     --trt-outputs mark all \
     --onnx-outputs mark all
yflv-yanxia commented 3 weeks ago

log_eff.txt Here are the results. @lix19937

yflv-yanxia commented 5 days ago

Hi, sorry to bother you, but is there any update on the solution? @lix19937

lix19937 commented 4 days ago

Sorry late to reply, you can try to use trt v10.3.

BTW, from you log, ccan find

log_eff.txt Here are the results. @lix19937

[E]         FAILED | Output: '/image_encoder/backbone/stages.0/op_list.0/act/Mul_output_0' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)
[I]     Comparing Output: '/image_encoder/backbone/stages.0/op_list.0/act/Mul_1_output_0' (dtype=float32, shape=(1, 32, 448, 448)) with '/image_encoder/backbone/stages.0/op_list.0/act/Mul_1_output_0' (dtype=float32, shape=(1, 32, 448, 448))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-08/02/24-08:52:02: /image_encoder/backbone/stages.0/op_list.0/act/Mul_1_output_0 | Stats: mean=41.007, std-dev=1005.6, var=1.0113e+06, median=0.00070975, min=-62717 at (0, 9, 440, 0), max=52399 at (0, 19, 4, 247), avg-magnitude=256.25
[I]             ---- Histogram ----
                Bin Range              |  Num Elems | Visualization
                (-6.27e+04, -5.12e+04) |          9 | 
                (-5.12e+04, -3.97e+04) |         41 | 
                (-3.97e+04, -2.82e+04) |        171 | 
                (-2.82e+04, -1.67e+04) |        904 | 
                (-1.67e+04, -5.16e+03) |      18234 | 
                (-5.16e+03, 6.35e+03 ) |    6385873 | ########################################
                (6.35e+03 , 1.79e+04 ) |      16373 | 
                (1.79e+04 , 2.94e+04 ) |        817 | 
                (2.94e+04 , 4.09e+04 ) |         91 | 
                (4.09e+04 , 5.24e+04 ) |         15 | 

from /image_encoder/backbone/stages.0/op_list.0/act/Mul_output_0 is get the diff result.