Open yflv-yanxia opened 4 months ago
Run follow ,then upload the log.
./trtexec --onnx=./test.onnx --device=0 --saveEngine=./test.trtmodel --verbose 2>&1 |tee build.log
build.log This is the log @lix19937
From your log,
[07/15/2024-08:14:59] [E] Error[1]: [cudaResources.cpp::haveStreamOrderedAllocatorHelper::15] Error Code 1: Cuda Runtime (invalid argument)
maybe has problem, also just a warning, need check.
Use follow cmd to compare the result with onnxruntime
polygraphy run test.onnx --trt --onnxrt
log_testonnx.txt Here are the results after running the above instructions. @lix19937
The result has a big diff.
Use follow cmd @ trt8.6 and trt10.0 ,and then upload the two li.json
trtexec --onnx=model_sim.onnx --verbose \
--dumpProfile --dumpLayerInfo --separateProfileRun \
--noDataTransfers --useCudaGraph --useSpinWait --profilingVerbosity=detailed --exportLayerInfo=li.json
li_10.json li_86.json Here are the results. @lix19937
They choose some different tactic, btw two build log can provide ? They run at the same machine ?
Another, you can try the TensorRT v10.2 .
build.log This is the log @lix19937
The build log for TRT10 has been provided before. Below is the build log for TRT8.6. build_86.log The previously provided li_86.json was not obtained on the same machine. I have now obtained li_86.json on the same machine. li_86(1).json @lix19937
You can try to use follow cmd
polygraphy run model_sim.onnx --trt --onnxrt \
--trt-outputs mark all \
--onnx-outputs mark all
log_eff.txt Here are the results. @lix19937
Hi, sorry to bother you, but is there any update on the solution? @lix19937
I tried TensorRT 10.3, and it's the same error. If it's an issue with the elemwise layer, is there a plan to fix it?
Do you have any suggestions for solving this problem? @lix19937
@yflv-yanxia
Are you compare your onnx result with torch forward ?
If onnx forward result is right, then you can try to add --noTF32
in trtexec cmd like follow:
trtexec --noTF32
From you log, can find /image_encoder/backbone/stages.3/op_list.5/main/inverted_conv/act/Mul_1_output_0
out first raise max diff, you can split your model to check .
[I] Error Metrics: /image_encoder/backbone/stages.3/op_list.5/main/inverted_conv/act/Mul_1_output_0
[I] Minimum Required Tolerance: elemwise error | [abs=141.75] OR [rel=3.9318e+06] (requirements may be lower if both abs/rel tolerances are set)
[I] Absolute Difference | Stats: mean=0.00049691, std-dev=0.07922, var=0.0062759, median=0.00014859, min=0 at (0, 3, 50, 37), max=141.75 at (0, 459, 55, 55), avg-magnitude=0.00049691
[I] ---- Histogram ----
Bin Range | Num Elems | Visualization
(0 , 14.2) | 3211263 | ########################################
(14.2, 28.4) | 0 |
(28.4, 42.5) | 0 |
(42.5, 56.7) | 0 |
(56.7, 70.9) | 0 |
(70.9, 85.1) | 0 |
(85.1, 99.2) | 0 |
(99.2, 113 ) | 0 |
(113 , 128 ) | 0 |
(128 , 142 ) | 1 |
We've identified that the output from the layer '/bm_decoder/output_hypernetworks_mlps.0/layers.0/Gemm' in the TensorRT model is incorrect, while the output from the same layer in the ONNX model is correct. The outputs of all the layers before this one are correct. Interestingly, if we remove the layers before this one, the output becomes correct. It seems like some of the preceding layers are affecting the output of this layer. We've narrowed down the issue to this layer. Could you help us resolve it? @lix19937
Sorry to bother you again, but I wanted to ask if there's any further solution now? @lix19937
We've identified that the output from the layer '/bm_decoder/output_hypernetworks_mlps.0/layers.0/Gemm' in the TensorRT model is incorrect, while the output from the same layer in the ONNX model is correct. The outputs of all the layers before this one are correct. Interestingly, if we remove the layers before this one, the output becomes correct. It seems like some of the preceding layers are affecting the output of this layer. We've narrowed down the issue to this layer. Could you help us resolve it?
I can instance an internal bug for this; in order to reproduce, the environment is the same as specified in your first message and this also occurs for TensorRT 10.3, correct?
Yes, that's correct. @moraxu
Sorry, but can you also specify your OS?
Operating System: CentOS Linux release 7.8.2003 (Core) @moraxu
Seems like the accuracy regression was fixed in TRT 10.5. Can you please give it a try? @yflv-yanxia
Thanks!I tested using the latest TRT 10.5 on an A6000 GPU on another machine, and the results were correct. However, when using the latest TRT 10.5 tool, trtexec, to convert the ONNX model on this machine's T4 GPU, the following error occurred. Could you help me figure out what the issue might be? @yuanyao-nv
@yflv-yanxia Is cuda updated on the T4 machine?
Thank you very much. After upgrading the CUDA version, the FP32 model can now be converted and inferred on the T4 GPU. Next, I tried the FP16 model, but the inference results were incorrect. The ONNX model has some updates, and I’ve re-uploaded it—please see the attachment. onnx model We identified that the issue is with this layer: /bm_decoder/transformer/layers.0/cross_attn_token_to_image/Softmax. The results before this layer are still correct, but the output of this layer is wrong. We tried using the --layerPrecisions parameter to set this layer to FP32 during model conversion, but it didn’t work. @yuanyao-nv
Sorry to bother you again, but I wanted to ask if there's any solution now? @moraxu @yuanyao-nv
@yflv-yanxia Sorry about the delayed response. Are you saying that in TRT 10.5 the accuracy problem has changed to a different layer in the model?
No, in TensorRT versions before 10.5, there were accuracy issues even in FP32 mode. TensorRT 10.5 fixed the accuracy issues in FP32 mode, and now the FP32 model works correctly. However, the FP16 model still has accuracy issues. We've identified that the issue lies in this layer: /bm_decoder/transformer/layers.0/cross_attn_token_to_image/Softmax. We need your help to resolve this. @yuanyao-nv
Sorry to bother you again, but I wanted to ask if there's any solution now? @moraxu @yuanyao-nv
@yflv-yanxia I have created an internal bug for the FP16 Softmax issue as well. Will keep you updated. Thanks!
Thank you very much. After upgrading the CUDA version, the FP32 model can now be converted and inferred on the T4 GPU. Next, I tried the FP16 model, but the inference results were incorrect. The ONNX model has some updates, and I’ve re-uploaded it—please see the attachment. onnx model We identified that the issue is with this layer: /bm_decoder/transformer/layers.0/cross_attn_token_to_image/Softmax. The results before this layer are still correct, but the output of this layer is wrong. We tried using the --layerPrecisions parameter to set this layer to FP32 during model conversion, but it didn’t work. @yuanyao-nv
@yflv-yanxia We have requested access to your model but didn't receive reply. Can you please give access?
Thank you very much. After upgrading the CUDA version, the FP32 model can now be converted and inferred on the T4 GPU. Next, I tried the FP16 model, but the inference results were incorrect. The ONNX model has some updates, and I’ve re-uploaded it—please see the attachment. onnx model We identified that the issue is with this layer: /bm_decoder/transformer/layers.0/cross_attn_token_to_image/Softmax. The results before this layer are still correct, but the output of this layer is wrong. We tried using the --layerPrecisions parameter to set this layer to FP32 during model conversion, but it didn’t work. @yuanyao-nv
@yflv-yanxia please grant read permissions for the updated ONNX model. Otherwise we would not be able to debug this.
Sorry, I just saw your message. The link is accessible now. @yuanyao-nv @galagam
@yflv-yanxia What are the expected inputs for this model? Given random uniform input in [0,1], this network generates very high magnitudes, which naturally overflow in FP16. The attached screenshot shows a structure of conv output raised to the power of 3. After the second Mul, max value is ~70K (==FP16 inf).
I see some NaNs in the network well before the mentioned Softmax layer.
The network input values range from 0 to 1. For such a network, does that mean FP16 precision inference is definitely not feasible? Do you have any suggestions? @galagam
FP16 accuracy is best when processing normalized data. You can read a bit about how large magnitude affects the accuracy in the TensorRT developer guide.
As a general rule, if you run a network successfully in FP16 using ONNX-RT - TensorRT will be able to do the same using TensorRT strongly-typed mode. In order to do that, you'll need to add explicit cast nodes to/from FP16 around subgraphs that should be computed in FP16.
You can enable FP16 compute for subgraphs that process large magnitude data by scaling down the data (e.g. multiply by some scalar), to ensure the magnitude remains small.
Description
The output of the TensorRT 10 model converted from ONNX is incorrect, while the output of the TensorRT 8.6 model is correct. The issue seems to be located in some fully connected layers in the TensorRT 10 model, where the error in the output suddenly becomes very large. The exact cause is unknown. Please help to resolve this issue.
Environment
TensorRT Version: TensorRT 10.0.1
NVIDIA GPU: Tesla T4
NVIDIA Driver Version: 450.36.06
CUDA Version: 11.0
CUDNN Version:8.0.0
Operating System:
onnx opset17
Relevant Files
Model link: https://drive.google.com/file/d/1QBbmtdaecWAHzqMdh10QVbdSjTWzleqo/view?usp=sharing
Steps To Reproduce