wrong results of TensorRT 10.0 when running on GPU Tesla T4

yflv-yanxia commented 4 months ago

Description

The output of the TensorRT 10 model converted from ONNX is incorrect, while the output of the TensorRT 8.6 model is correct. The issue seems to be located in some fully connected layers in the TensorRT 10 model, where the error in the output suddenly becomes very large. The exact cause is unknown. Please help to resolve this issue.

Environment

TensorRT Version: TensorRT 10.0.1

NVIDIA GPU: Tesla T4

NVIDIA Driver Version: 450.36.06

CUDA Version: 11.0

CUDNN Version:8.0.0

Operating System:

onnx opset17

Relevant Files

Model link: https://drive.google.com/file/d/1QBbmtdaecWAHzqMdh10QVbdSjTWzleqo/view?usp=sharing

Steps To Reproduce

Convert the ONNX model to TensorRT 10 using ./trtexec --onnx=./test.onnx --device=0 --saveEngine=./test.trtmodel --precisionConstraints=obey.

lix19937 commented 4 months ago

Run follow ,then upload the log.

 ./trtexec --onnx=./test.onnx --device=0 --saveEngine=./test.trtmodel   --verbose  2>&1 |tee  build.log

yflv-yanxia commented 4 months ago

build.log This is the log @lix19937

lix19937 commented 4 months ago

From your log,

[07/15/2024-08:14:59] [E] Error[1]: [cudaResources.cpp::haveStreamOrderedAllocatorHelper::15] Error Code 1: Cuda Runtime (invalid argument)

maybe has problem, also just a warning, need check.
Use follow cmd to compare the result with onnxruntime

polygraphy run test.onnx --trt --onnxrt

yflv-yanxia commented 3 months ago

log_testonnx.txt Here are the results after running the above instructions. @lix19937

lix19937 commented 3 months ago

The result has a big diff.

Use follow cmd @ trt8.6 and trt10.0 ,and then upload the two li.json


trtexec --onnx=model_sim.onnx --verbose   \
--dumpProfile --dumpLayerInfo --separateProfileRun \
--noDataTransfers --useCudaGraph --useSpinWait --profilingVerbosity=detailed  --exportLayerInfo=li.json

yflv-yanxia commented 3 months ago

li_10.json li_86.json Here are the results. @lix19937

lix19937 commented 3 months ago

They choose some different tactic, btw two build log can provide ? They run at the same machine ?

lix19937 commented 3 months ago

Another, you can try the TensorRT v10.2 .

yflv-yanxia commented 3 months ago

build.log This is the log @lix19937

The build log for TRT10 has been provided before. Below is the build log for TRT8.6. build_86.log The previously provided li_86.json was not obtained on the same machine. I have now obtained li_86.json on the same machine. li_86(1).json @lix19937

lix19937 commented 3 months ago

You can try to use follow cmd

 polygraphy run model_sim.onnx --trt --onnxrt \
     --trt-outputs mark all \
     --onnx-outputs mark all

yflv-yanxia commented 3 months ago

log_eff.txt Here are the results. @lix19937

yflv-yanxia commented 2 months ago

Hi, sorry to bother you, but is there any update on the solution? @lix19937

yflv-yanxia commented 2 months ago

I tried TensorRT 10.3, and it's the same error. If it's an issue with the elemwise layer, is there a plan to fix it?

yflv-yanxia commented 2 months ago

Do you have any suggestions for solving this problem? @lix19937

lix19937 commented 2 months ago

@yflv-yanxia
Are you compare your onnx result with torch forward ?

If onnx forward result is right, then you can try to add --noTF32 in trtexec cmd like follow:
trtexec --noTF32

From you log, can find /image_encoder/backbone/stages.3/op_list.5/main/inverted_conv/act/Mul_1_output_0 out first raise max diff, you can split your model to check .

[I]         Error Metrics: /image_encoder/backbone/stages.3/op_list.5/main/inverted_conv/act/Mul_1_output_0
[I]             Minimum Required Tolerance: elemwise error | [abs=141.75] OR [rel=3.9318e+06] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.00049691, std-dev=0.07922, var=0.0062759, median=0.00014859, min=0 at (0, 3, 50, 37), max=141.75 at (0, 459, 55, 55), avg-magnitude=0.00049691
[I]                 ---- Histogram ----
                    Bin Range    |  Num Elems | Visualization
                    (0   , 14.2) |    3211263 | ########################################
                    (14.2, 28.4) |          0 | 
                    (28.4, 42.5) |          0 | 
                    (42.5, 56.7) |          0 | 
                    (56.7, 70.9) |          0 | 
                    (70.9, 85.1) |          0 | 
                    (85.1, 99.2) |          0 | 
                    (99.2, 113 ) |          0 | 
                    (113 , 128 ) |          0 | 
                    (128 , 142 ) |          1 |

yflv-yanxia commented 2 months ago

We've identified that the output from the layer '/bm_decoder/output_hypernetworks_mlps.0/layers.0/Gemm' in the TensorRT model is incorrect, while the output from the same layer in the ONNX model is correct. The outputs of all the layers before this one are correct. Interestingly, if we remove the layers before this one, the output becomes correct. It seems like some of the preceding layers are affecting the output of this layer. We've narrowed down the issue to this layer. Could you help us resolve it? @lix19937

yflv-yanxia commented 1 month ago

Sorry to bother you again, but I wanted to ask if there's any further solution now? @lix19937

moraxu commented 1 month ago

We've identified that the output from the layer '/bm_decoder/output_hypernetworks_mlps.0/layers.0/Gemm' in the TensorRT model is incorrect, while the output from the same layer in the ONNX model is correct. The outputs of all the layers before this one are correct. Interestingly, if we remove the layers before this one, the output becomes correct. It seems like some of the preceding layers are affecting the output of this layer. We've narrowed down the issue to this layer. Could you help us resolve it?

I can instance an internal bug for this; in order to reproduce, the environment is the same as specified in your first message and this also occurs for TensorRT 10.3, correct?

yflv-yanxia commented 1 month ago

Yes, that's correct. @moraxu

moraxu commented 1 month ago

Sorry, but can you also specify your OS?

yflv-yanxia commented 1 month ago

Operating System: CentOS Linux release 7.8.2003 (Core) @moraxu

yuanyao-nv commented 1 month ago

Seems like the accuracy regression was fixed in TRT 10.5. Can you please give it a try? @yflv-yanxia

yflv-yanxia commented 1 month ago

Thanks！I tested using the latest TRT 10.5 on an A6000 GPU on another machine, and the results were correct. However, when using the latest TRT 10.5 tool, trtexec, to convert the ONNX model on this machine's T4 GPU, the following error occurred. Could you help me figure out what the issue might be? @yuanyao-nv

yuanyao-nv commented 1 month ago

@yflv-yanxia Is cuda updated on the T4 machine?

yflv-yanxia commented 3 weeks ago

Thank you very much. After upgrading the CUDA version, the FP32 model can now be converted and inferred on the T4 GPU. Next, I tried the FP16 model, but the inference results were incorrect. The ONNX model has some updates, and I’ve re-uploaded it—please see the attachment. onnx model We identified that the issue is with this layer: /bm_decoder/transformer/layers.0/cross_attn_token_to_image/Softmax. The results before this layer are still correct, but the output of this layer is wrong. We tried using the --layerPrecisions parameter to set this layer to FP32 during model conversion, but it didn’t work. @yuanyao-nv

yflv-yanxia commented 2 weeks ago

Sorry to bother you again, but I wanted to ask if there's any solution now? @moraxu @yuanyao-nv

yuanyao-nv commented 2 weeks ago

@yflv-yanxia Sorry about the delayed response. Are you saying that in TRT 10.5 the accuracy problem has changed to a different layer in the model?

yflv-yanxia commented 2 weeks ago

No, in TensorRT versions before 10.5, there were accuracy issues even in FP32 mode. TensorRT 10.5 fixed the accuracy issues in FP32 mode, and now the FP32 model works correctly. However, the FP16 model still has accuracy issues. We've identified that the issue lies in this layer: /bm_decoder/transformer/layers.0/cross_attn_token_to_image/Softmax. We need your help to resolve this. @yuanyao-nv

yflv-yanxia commented 1 week ago

Sorry to bother you again, but I wanted to ask if there's any solution now? @moraxu @yuanyao-nv

yuanyao-nv commented 1 week ago

@yflv-yanxia I have created an internal bug for the FP16 Softmax issue as well. Will keep you updated. Thanks!

yuanyao-nv commented 1 week ago

Thank you very much. After upgrading the CUDA version, the FP32 model can now be converted and inferred on the T4 GPU. Next, I tried the FP16 model, but the inference results were incorrect. The ONNX model has some updates, and I’ve re-uploaded it—please see the attachment. onnx model We identified that the issue is with this layer: /bm_decoder/transformer/layers.0/cross_attn_token_to_image/Softmax. The results before this layer are still correct, but the output of this layer is wrong. We tried using the --layerPrecisions parameter to set this layer to FP32 during model conversion, but it didn’t work. @yuanyao-nv

@yflv-yanxia We have requested access to your model but didn't receive reply. Can you please give access?

galagam commented 6 days ago

Thank you very much. After upgrading the CUDA version, the FP32 model can now be converted and inferred on the T4 GPU. Next, I tried the FP16 model, but the inference results were incorrect. The ONNX model has some updates, and I’ve re-uploaded it—please see the attachment. onnx model We identified that the issue is with this layer: /bm_decoder/transformer/layers.0/cross_attn_token_to_image/Softmax. The results before this layer are still correct, but the output of this layer is wrong. We tried using the --layerPrecisions parameter to set this layer to FP32 during model conversion, but it didn’t work. @yuanyao-nv

@yflv-yanxia please grant read permissions for the updated ONNX model. Otherwise we would not be able to debug this.

yflv-yanxia commented 6 days ago

Sorry, I just saw your message. The link is accessible now. @yuanyao-nv @galagam

galagam commented 6 days ago

@yflv-yanxia What are the expected inputs for this model? Given random uniform input in [0,1], this network generates very high magnitudes, which naturally overflow in FP16. The attached screenshot shows a structure of conv output raised to the power of 3. After the second Mul, max value is ~70K (==FP16 inf).

I see some NaNs in the network well before the mentioned Softmax layer.

yflv-yanxia commented 5 days ago

The network input values range from 0 to 1. For such a network, does that mean FP16 precision inference is definitely not feasible? Do you have any suggestions? @galagam

galagam commented 4 days ago

FP16 accuracy is best when processing normalized data. You can read a bit about how large magnitude affects the accuracy in the TensorRT developer guide.

As a general rule, if you run a network successfully in FP16 using ONNX-RT - TensorRT will be able to do the same using TensorRT strongly-typed mode. In order to do that, you'll need to add explicit cast nodes to/from FP16 around subgraphs that should be computed in FP16.

You can enable FP16 compute for subgraphs that process large magnitude data by scaling down the data (e.g. multiply by some scalar), to ensure the magnitude remains small.

NVIDIA / TensorRT