The results from the converted model doesn't match the ONNX original version.

PINTO0309 / onnx2tf

Self-Created Tools to convert ONNX files (NCHW) to TensorFlow/TFLite/Keras format (NHWC). The purpose of this tool is to solve the massive Transpose extrapolation problem in onnx-tensorflow (onnx-tf). I don't need a Star, but give me a pull request.

MIT License

706 stars 73 forks source link

The results from the converted model doesn't match the ONNX original version. #656

Closed PetiteFleurPF closed 4 months ago

PetiteFleurPF commented 4 months ago

Issue Type

Others

OS

Linux

onnx2tf version number

1.23.0

onnx version number

1.15.0

onnxruntime version number

1.17.1

onnxsim (onnx_simplifier) version number

0.4.33

tensorflow version number

2.16.1

Download URL for ONNX without post-processing

https://we.tl/t-Z6RrkZRZxf (TFLITE version) https://we.tl/t-PH8hLCyGu9 (ONNX version)

Parameter Replacement JSON

Description

Product development - impact: could be improved inference time to obtain TFLITE version.
The results from the TFLITE version of our model look like random bounding box coordinates (in size and location) when you plot them.
I investigated the differences between both results - they appeared before the post-processing part. So I tried to observe the differences, one of which was the observation of node deletion during conversion. During the conversion process, some nodes disappeared: I tested every option, even the one that stopped the simplification and wasn't the solution. It is very surprising because these nodes existed during a moment because they appeared during process steps and validation:
1. To succeed in the conversion of the converted ONNX model.
2. You don't need more.
PS: the previous problem was solved by only deleting the NMS part and not everything into the post-processing process but you helped us a lot to underline the correct way to do things. So thank you.

PINTO0309 commented 4 months ago

I don't see the correlation between the attached ONNX file and the image you posted.

PetiteFleurPF commented 4 months ago

I can't send you the final model, the one I shared with you is similar and reproduces the same error. All the images I've pasted are from the final template.

PetiteFleurPF commented 4 months ago

For example, after use the model that I shared with you with the pipeline I obtained that: and that: And so you can reproduce the error.

PINTO0309 commented 4 months ago

I still don't understand what your concern is. The ONNX DepthwiseConv2d -> Clip -> Conv2d -> Reshape -> Transpose -> Reshape section you illustrate is a rather redundant and useless combination of multiple OPs. I have no idea what the problem is with changing the shape of [1,24,1,1] to [1,6,4]. Your ONNX file is too much useless processing.

This tool automatically eliminates all unnecessary processing.

ONNX
TFLite

PetiteFleurPF commented 4 months ago

OK - I'm in the process of removing the post-processing from the model so that we can start from the same observation, which is that the bounding boxes produced by the model are purely random - this can be seen in the various sizes and locations. In short, the converted version seems to have lost all the information obtained during training. I'll write again when the model is uploaded

PetiteFleurPF commented 4 months ago

And I try to understand why. That is why I made a hypothesis about the modification of the nodes

PINTO0309 commented 4 months ago

onnx2tf -i model.onnx \
-onimc /head/regression_head/Concat_12_output_0 /Softmax_output_0 \
-cotof

PINTO0309 commented 4 months ago

The myriad uses of NonZero in post-processing severely inhibit the normal model transformation behavior of onnx2tf because the output is non-deterministic.

PINTO0309 commented 4 months ago

If you want to implement NMS, bounding box filtering by NonZero + TopK + If is quite redundant and quite wasteful to use for inference in TFLite

PetiteFleurPF commented 4 months ago

I still not updated the model ^^ Don't focus on the post-processing - the problem, my problem is not coming from there :) you will see (I am deleting them now (the post-process) to recreate a model without it to let you do a correct investigation) give me 5minutes.

onnx2tf -i model.onnx \
-onimc /head/regression_head/Concat_12_output_0 /Softmax_output_0 \
-cotof
and here you reproduced what I said :) and after that, it's the post-processing, but at this moment, the output produced by the node "concat" is incorrect.

PetiteFleurPF commented 4 months ago

onnx2tf -i model.onnx \
-onimc /head/regression_head/Concat_12_output_0 /Softmax_output_0 \
-cotof

ok it seems you already did it. So just do a test of detection and you will see that the result from the model ONNX and TFLITE are absolute not the same. And the results from TFLITE are made of random bounding boxes.

PetiteFleurPF commented 4 months ago

I just finished to delete the post-processing and extracted the data so:

With this new version - I obtained that before apply any post-processing step:
With the version converted:

So you can see than the results are not the same.

PINTO0309 commented 4 months ago

Thus, I still don't understand why they claim that the output values are different, even though the results of comparing all the elements are identical. If you say that you are comparing outputs with all post-processing really removed, then they must match.

This is because the -cotof option compares the ONNX output with the TensorFlow output with near-perfect precision for all elements one by one.

You always seem to paste the output as an image, but that doesn't give me any of the information I need. Does the shape of the output tensor in ONNX exactly match the shape of the output tensor in TensorFlow? If the channel positions do not match exactly, there is absolutely no point in comparing element-by-element values.

PetiteFleurPF commented 4 months ago

TFLITE shape:
ONNX shape: it's the same shape - ONNX asks just for one more dimension but I don't think it changes anything.

When I added my post-processing steps into the ONNX model before the conversion, without NMS: with ONNX - the bounding boxes made sense in terms of localization but for TFLITE no.

PINTO0309 commented 4 months ago

Organize. I still don't understand the claim, so this will be my last reply.

The logic of ONNX and the preprocessing of the TFLite inference process are exactly the same.
No quantization is performed.
Normalization must not be performed in preprocessing.
Whenever all input data are set to 1, the inference results always match. (12,936 All elements match.)
The output of /head/regression_head/Concat_12_output_0 and /Softmax_output_0 all match at the element level.
Your final bounding box result is meaningless.
I'm not interested in the post-processing you wrote.
I have no idea where the output [3234, 4] came from.
This is not a problem with the model generated by onnx2tf. It's a difference problem in your pre-processing or post-processing.

INFO: onnx_output_name: /head/regression_head/Concat_12_output_0 tf_output_name: tf.concat/concat:0 shape: (1, 3234, 4) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/classification_head/Concat_12_output_0 tf_output_name: tf.concat_1/concat:0 shape: (1, 3234, 91) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /Softmax_output_0 tf_output_name: tf.nn.softmax//Softmax:0 shape: (1, 3234, 91) dtype: float32 validate_result:  Matches

PetiteFleurPF commented 4 months ago

The solution was "Whenever all input data are set to 1". With others TFLITE models that we used, it wasn't the case. I tested and all works perfectly now. Many many .... thanks ! :D