PINTO0309 / onnx2tf

Self-Created Tools to convert ONNX files (NCHW) to TensorFlow/TFLite/Keras format (NHWC). The purpose of this tool is to solve the massive Transpose extrapolation problem in onnx-tensorflow (onnx-tf). I don't need a Star, but give me a pull request.
MIT License
666 stars 67 forks source link

[InSPyReNet] Swin Transformer Support question #312

Closed bernii closed 1 year ago

bernii commented 1 year ago

Issue Type

Others

onnx2tf version number

1.9.2

onnx version number

1.13.1

tensorflow version number

2.12.0

Download URL for ONNX

https://github.com/plemeri/InSPyReNet

Parameter Replacement JSON

None

Description

This is more of a general question - are there any success stories of using onnx2tf with Swin Transformers (for example https://github.com/plemeri/InSPyReNet is using SwinB)? I tried conversion and it passed without errors to saved_model format but inference gives me output that has nothing to do with the input.

Wondering if that's a known issue/untested territory or maybe I do have problem on my side

PINTO0309 commented 1 year ago

The -cotof option must be used to verify that the model has been converted correctly. The conversion from NCHW to NHWC cannot be 100% fully automatic and perfect. Therefore, it is necessary to check with the human eye whether the final generated model is degraded or not.

See this issue. When Onnx Matmul inputs have different dimension #133

are there any success stories of using onnx2tf with Swin Transformers

Yes.

https://github.com/PINTO0309/onnx2tf#validated-models-without-replacementjson

image

If you share the ONNX files with me, I will look into it. If you can't, I won't look any further. The work cost of generating ONNX wastes my private time.

bernii commented 1 year ago

Thank you for the response! and valuable pointers 😄

I've uploaded the file here: https://github.com/bernii/aut-1-5/releases/download/test/latest.opset17.onnx in case you want to take a look. I'd be happy to debug myself further but I'm not sure what the procedure should be - there's lots of going on here.

I just re-run the conversion with

onnx2tf -i latest.opset17.onnx -cotof -osd

and got following result. It seems that there was Out Of Memory error during the verification process as it was trying to allocate ~60GB of RAM.

saved_model output started ==========================================================
saved_model output complete!
Float32 tflite output complete!
Float16 tflite output complete!

ONNX and TF output value validation started =========================================
INFO: validation_conditions: np.allclose(onnx_outputs, tf_outputs, rtol=0.0, atol=0.0001, equal_nan=True)
2023-04-15 13:52:43.712462 [W:onnxruntime:, graph.cc:107 MergeShapeInfo] Error merging shape info for output. '/model/ReduceMax_output_0' source:{1} target:{}. Falling back to lenient merge.
2023-04-15 13:52:43.712499 [W:onnxruntime:, graph.cc:107 MergeShapeInfo] Error merging shape info for output. '/model/ReduceMin_output_0' source:{1} target:{}. Falling back to lenient merge.
2023-04-15 13:52:43.747498 [W:onnxruntime:, graph.cc:107 MergeShapeInfo] Error merging shape info for output. '/model/Sub_1_output_0' source:{1} target:{}. Falling back to lenient merge.
2023-04-15 13:52:57.961792 [W:onnxruntime:, execution_frame.cc:828 VerifyOutputSizes] Expected shape from model of {} does not match actual shape of {1} for output /model/Add_3_output_0
[1]    29187 killed     onnx2tf -i latest.opset17.onnx -cotof -osd
PINTO0309 commented 1 year ago

and got following result. It seems that there was Out Of Memory error during the verification process as it was trying to allocate ~60GB of RAM.

image

I see. This seems to be too large for the model. If time permits, I feel that the path to success is closer if the model is separated and validated in as much detail as possible.

Using the -cotof option will cause OOM to occur more easily because it tries to keep the results of inference by ONNX in all OPs. INT64 tensors, for example, are typically converted to Numpy data of large size. I will be doing other challenging tasks today, so I can't validate it right away, but if you are in a hurry, you can try to split the model into smaller pieces yourself and validate the accuracy with the -cotof option.

https://github.com/PINTO0309/sne4onnx

image

bernii commented 1 year ago

Not in a hurry but I'd love to help if possible :)

I split a part of the model with sne4onnx

sne4onnx --input_onnx_file_path latest.opset17.onnx \
--input_op_names input \
--output_op_names  /model/backbone/layers.0/blocks.1/Add_1_output_0 \
--output_onnx_file_path latest.opset17.head1.onnx

and seems that things are going sideways very early as I see Unmatched on the 4th operation (LayerNormalization_output_0)

ONNX and TF output value validation started =========================================
INFO: validation_conditions: np.allclose(onnx_outputs, tf_outputs, rtol=0.0, atol=0.0001, equal_nan=True)
INFO: onnx_output_name: /model/backbone/patch_embed/proj/Conv_output_0 tf_output_name: tf.math.add/Add:0 shape: (1, 128, 256, 256) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /model/backbone/patch_embed/Reshape_output_0 tf_output_name: tf.reshape/Reshape:0 shape: (1, 128, 65536) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /model/backbone/patch_embed/Transpose_output_0 tf_output_name: tf.compat.v1.transpose_1/transpose:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /model/backbone/patch_embed/norm/LayerNormalization_output_0 tf_output_name: tf.math.add_2/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.7289307713508606

Full log

ONNX and TF output value validation started =========================================
INFO: validation_conditions: np.allclose(onnx_outputs, tf_outputs, rtol=0.0, atol=0.0001, equal_nan=True)
INFO: onnx_output_name: /model/backbone/patch_embed/proj/Conv_output_0 tf_output_name: tf.math.add/Add:0 shape: (1, 128, 256, 256) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /model/backbone/patch_embed/Reshape_output_0 tf_output_name: tf.reshape/Reshape:0 shape: (1, 128, 65536) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /model/backbone/patch_embed/Transpose_output_0 tf_output_name: tf.compat.v1.transpose_1/transpose:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /model/backbone/patch_embed/norm/LayerNormalization_output_0 tf_output_name: tf.math.add_2/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.7289307713508606
INFO: onnx_output_name: /model/backbone/patch_embed/Transpose_1_output_0 tf_output_name: tf.compat.v1.transpose_2/transpose:0 shape: (1, 128, 65536) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.7289307713508606
INFO: onnx_output_name: /model/backbone/patch_embed/Reshape_1_output_0 tf_output_name: tf.reshape_1/Reshape:0 shape: (1, 128, 256, 256) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.7289307713508606
INFO: onnx_output_name: /model/backbone/Reshape_output_0 tf_output_name: tf.reshape_2/Reshape:0 shape: (1, 128, 65536) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.7289307713508606
INFO: onnx_output_name: /model/backbone/Transpose_output_0 tf_output_name: tf.compat.v1.transpose_5/transpose:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.7289307713508606
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/norm1/LayerNormalization_output_0 tf_output_name: tf.math.add_4/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.36497163772583
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Reshape_output_0 tf_output_name: tf.reshape_3/Reshape:0 shape: (1, 256, 256, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.36497163772583
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Pad_output_0 tf_output_name: tf.compat.v1.pad//model/backbone/layers.0/blocks.0/Pad:0 shape: (1, 264, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.36497163772583
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Reshape_3_output_0 tf_output_name: tf.reshape_4/Reshape:0 shape: (1, 22, 12, 22, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.36497163772583
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Transpose_1_output_0 tf_output_name: tf.compat.v1.transpose_8/transpose:0 shape: (1, 22, 22, 12, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.36497163772583
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Reshape_4_output_0 tf_output_name: tf.reshape_5/Reshape:0 shape: (484, 12, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.36497163772583
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Reshape_5_output_0 tf_output_name: tf.reshape_6/Reshape:0 shape: (484, 144, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.36497163772583
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/qkv/MatMul_output_0 tf_output_name: tf.linalg.matmul/MatMul:0 shape: (484, 144, 384) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.9953839778900146
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/qkv/Add_output_0 tf_output_name: tf.math.add_5/Add:0 shape: (484, 144, 384) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.9953839778900146
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/Reshape_output_0 tf_output_name: tf.reshape_7/Reshape:0 shape: (484, 144, 3, 4, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.9953839778900146
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/Transpose_output_0 tf_output_name: tf.compat.v1.transpose_12/transpose:0 shape: (3, 484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.9953839778900146
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/Gather_3_output_0 tf_output_name: tf.compat.v1.gather/GatherV2:0 shape: (484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.9953839778900146
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/Gather_4_output_0 tf_output_name: tf.compat.v1.gather_1/GatherV2:0 shape: (484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.9548740386962891
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/Gather_5_output_0 tf_output_name: tf.compat.v1.gather_2/GatherV2:0 shape: (484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.971457839012146
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/Mul_output_0 tf_output_name: tf.math.multiply_8/Mul:0 shape: (484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.1759607195854187
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/Transpose_1_output_0 tf_output_name: tf.compat.v1.transpose_13/transpose:0 shape: (484, 4, 32, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.9548740386962891
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/MatMul_output_0 tf_output_name: tf.linalg.matmul_1/MatMul:0 shape: (484, 4, 144, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 7.500820159912109
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/Add_output_0 tf_output_name: tf.math.add_6/Add:0 shape: (484, 4, 144, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 7.500820159912109
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/softmax/Softmax_output_0 tf_output_name: tf.nn.softmax_1/Softmax:0 shape: (484, 4, 144, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.3356877565383911
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/MatMul_1_output_0 tf_output_name: tf.linalg.matmul_2/MatMul:0 shape: (484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.4929486513137817
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/Transpose_2_output_0 tf_output_name: tf.compat.v1.transpose_14/transpose:0 shape: (484, 144, 4, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.4929486513137817
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/Reshape_1_output_0 tf_output_name: tf.reshape_8/Reshape:0 shape: (484, 144, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.4929486513137817
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/proj/MatMul_output_0 tf_output_name: tf.linalg.matmul_3/MatMul:0 shape: (484, 144, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.8404282331466675
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/attn/proj/Add_output_0 tf_output_name: tf.math.add_7/Add:0 shape: (484, 144, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.8404282927513123
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Reshape_6_output_0 tf_output_name: tf.reshape_9/Reshape:0 shape: (484, 12, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.8404282927513123
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Reshape_7_output_0 tf_output_name: tf.reshape_10/Reshape:0 shape: (1, 22, 22, 12, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.8404282927513123
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Transpose_2_output_0 tf_output_name: tf.compat.v1.transpose_18/transpose:0 shape: (1, 22, 12, 22, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.8404282927513123
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Reshape_8_output_0 tf_output_name: tf.reshape_11/Reshape:0 shape: (1, 264, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.8404282927513123
INFO: onnx_output_name: 9741 tf_output_name: tf.strided_slice_1/StridedSlice:0 shape: (1, 256, 256, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.8404282927513123
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Reshape_9_output_0 tf_output_name: tf.reshape_12/Reshape:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.8404282927513123
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Add_output_0 tf_output_name: tf.math.add_8/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.8607869744300842
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/norm2/LayerNormalization_output_0 tf_output_name: tf.math.add_10/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 6.115452289581299
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/mlp/fc1/MatMul_output_0 tf_output_name: tf.linalg.matmul_4/MatMul:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 3.9582643508911133
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/mlp/fc1/Add_output_0 tf_output_name: tf.math.add_11/Add:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 3.9582643508911133
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/mlp/act/Div_output_0 tf_output_name: tf.math.divide/truediv:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 2.7989156246185303
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/mlp/act/Erf_output_0 tf_output_name: tf.math.erf/Erf:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.8526116609573364
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/mlp/act/Add_output_0 tf_output_name: tf.math.add_12/Add:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.8526116609573364
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/mlp/act/Mul_output_0 tf_output_name: tf.math.multiply_19/Mul:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 7.8544087409973145
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/mlp/act/Mul_1_output_0 tf_output_name: tf.math.multiply_21/Mul:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 3.9272043704986572
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/mlp/fc2/MatMul_output_0 tf_output_name: tf.linalg.matmul_5/MatMul:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 2.1741857528686523
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/mlp/fc2/Add_output_0 tf_output_name: tf.math.add_13/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 2.1741855144500732
INFO: onnx_output_name: /model/backbone/layers.0/blocks.0/Add_1_output_0 tf_output_name: tf.math.add_14/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 3.034972667694092
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/norm1/LayerNormalization_output_0 tf_output_name: tf.math.add_16/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Reshape_output_0 tf_output_name: tf.reshape_13/Reshape:0 shape: (1, 256, 256, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Pad_output_0 tf_output_name: tf.compat.v1.pad_1//model/backbone/layers.0/blocks.1/Pad:0 shape: (1, 264, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Slice_1_output_0 tf_output_name: tf.strided_slice_3/StridedSlice:0 shape: (1, 258, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Slice_2_output_0 tf_output_name: tf.strided_slice_5/StridedSlice:0 shape: (1, 6, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 2.4388442039489746
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Concat_3_output_0 tf_output_name: tf.concat/concat:0 shape: (1, 264, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Slice_3_output_0 tf_output_name: tf.strided_slice_7/StridedSlice:0 shape: (1, 264, 258, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Slice_4_output_0 tf_output_name: tf.strided_slice_9/StridedSlice:0 shape: (1, 264, 6, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.472478985786438
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Concat_4_output_0 tf_output_name: tf.concat_1/concat:0 shape: (1, 264, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Reshape_3_output_0 tf_output_name: tf.reshape_14/Reshape:0 shape: (1, 22, 12, 22, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Transpose_1_output_0 tf_output_name: tf.compat.v1.transpose_23/transpose:0 shape: (1, 22, 22, 12, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Reshape_4_output_0 tf_output_name: tf.reshape_15/Reshape:0 shape: (484, 12, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Reshape_5_output_0 tf_output_name: tf.reshape_16/Reshape:0 shape: (484, 144, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.020336151123047
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/qkv/MatMul_output_0 tf_output_name: tf.linalg.matmul_6/MatMul:0 shape: (484, 144, 384) dtype: float32 validate_result:  Unmatched  max_abs_error: 6.1924729347229
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/qkv/Add_output_0 tf_output_name: tf.math.add_17/Add:0 shape: (484, 144, 384) dtype: float32 validate_result:  Unmatched  max_abs_error: 6.192473411560059
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Reshape_output_0 tf_output_name: tf.reshape_17/Reshape:0 shape: (484, 144, 3, 4, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 6.192473411560059
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Transpose_output_0 tf_output_name: tf.compat.v1.transpose_27/transpose:0 shape: (3, 484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 6.192473411560059
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Gather_3_output_0 tf_output_name: tf.compat.v1.gather_3/GatherV2:0 shape: (484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.603039741516113
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Gather_4_output_0 tf_output_name: tf.compat.v1.gather_4/GatherV2:0 shape: (484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 6.192473411560059
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Gather_5_output_0 tf_output_name: tf.compat.v1.gather_5/GatherV2:0 shape: (484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.913492679595947
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Mul_output_0 tf_output_name: tf.math.multiply_31/Mul:0 shape: (484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.81371009349823
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Transpose_1_output_0 tf_output_name: tf.compat.v1.transpose_28/transpose:0 shape: (484, 4, 32, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 6.192473411560059
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/MatMul_output_0 tf_output_name: tf.linalg.matmul_7/MatMul:0 shape: (484, 4, 144, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 31.618335723876953
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Add_output_0 tf_output_name: tf.math.add_18/Add:0 shape: (484, 4, 144, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 31.61833381652832
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Reshape_1_output_0 tf_output_name: tf.reshape_18/Reshape:0 shape: (1, 484, 4, 144, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 31.61833381652832
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Add_1_output_0 tf_output_name: tf.math.add_19/Add:0 shape: (1, 484, 4, 144, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 31.61833381652832
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Reshape_2_output_0 tf_output_name: tf.reshape_19/Reshape:0 shape: (484, 4, 144, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 31.61833381652832
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/softmax/Softmax_output_0 tf_output_name: tf.nn.softmax_6/Softmax:0 shape: (484, 4, 144, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.9486146569252014
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/MatMul_1_output_0 tf_output_name: tf.linalg.matmul_8/MatMul:0 shape: (484, 4, 144, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 7.560842514038086
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Transpose_2_output_0 tf_output_name: tf.compat.v1.transpose_31/transpose:0 shape: (484, 144, 4, 32) dtype: float32 validate_result:  Unmatched  max_abs_error: 7.560842514038086
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/Reshape_3_output_0 tf_output_name: tf.reshape_20/Reshape:0 shape: (484, 144, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 7.560842514038086
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/proj/MatMul_output_0 tf_output_name: tf.linalg.matmul_9/MatMul:0 shape: (484, 144, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/attn/proj/Add_output_0 tf_output_name: tf.math.add_20/Add:0 shape: (484, 144, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Reshape_6_output_0 tf_output_name: tf.reshape_21/Reshape:0 shape: (484, 12, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Reshape_7_output_0 tf_output_name: tf.reshape_22/Reshape:0 shape: (1, 22, 22, 12, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Transpose_2_output_0 tf_output_name: tf.compat.v1.transpose_35/transpose:0 shape: (1, 22, 12, 22, 12, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Reshape_8_output_0 tf_output_name: tf.reshape_23/Reshape:0 shape: (1, 264, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Slice_5_output_0 tf_output_name: tf.strided_slice_11/StridedSlice:0 shape: (1, 6, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.26747727394104
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Slice_6_output_0 tf_output_name: tf.strided_slice_13/StridedSlice:0 shape: (1, 258, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Concat_9_output_0 tf_output_name: tf.concat_2/concat:0 shape: (1, 264, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Slice_7_output_0 tf_output_name: tf.strided_slice_15/StridedSlice:0 shape: (1, 264, 6, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.6969082355499268
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Slice_8_output_0 tf_output_name: tf.strided_slice_17/StridedSlice:0 shape: (1, 264, 258, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Concat_10_output_0 tf_output_name: tf.concat_3/concat:0 shape: (1, 264, 264, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: 9746 tf_output_name: tf.strided_slice_19/StridedSlice:0 shape: (1, 256, 256, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Reshape_9_output_0 tf_output_name: tf.reshape_24/Reshape:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.7892866134643555
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Add_output_0 tf_output_name: tf.math.add_21/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 2.767470359802246
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/norm2/LayerNormalization_output_0 tf_output_name: tf.math.add_23/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 11.945793151855469
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/mlp/fc1/MatMul_output_0 tf_output_name: tf.linalg.matmul_10/MatMul:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 6.830749988555908
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/mlp/fc1/Add_output_0 tf_output_name: tf.math.add_24/Add:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 6.830749988555908
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/mlp/act/Div_output_0 tf_output_name: tf.math.divide_1/truediv:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.8300700187683105
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/mlp/act/Erf_output_0 tf_output_name: tf.math.erf_1/Erf:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.9050416946411133
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/mlp/act/Add_output_0 tf_output_name: tf.math.add_25/Add:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 1.9050416946411133
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/mlp/act/Mul_output_0 tf_output_name: tf.math.multiply_42/Mul:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 6.467710018157959
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/mlp/act/Mul_1_output_0 tf_output_name: tf.math.multiply_44/Mul:0 shape: (1, 65536, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 3.2338550090789795
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/mlp/fc2/MatMul_output_0 tf_output_name: tf.linalg.matmul_11/MatMul:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 3.2158801555633545
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/mlp/fc2/Add_output_0 tf_output_name: tf.math.add_26/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 3.2158801555633545
INFO: onnx_output_name: /model/backbone/layers.0/blocks.1/Add_1_output_0 tf_output_name: tf.math.add_27/Add:0 shape: (1, 65536, 128) dtype: float32 validate_result:  Unmatched  max_abs_error: 4.008249282836914
PINTO0309 commented 1 year ago

Thank you. It's a LayerNormalization conversion bug. Thanks to you I was able to find the problem in my logic. :+1:

Once the CI is all green, I will release the revised package in about an hour.

image

bernii commented 1 year ago

Also, I found one more thing - my onnx model was already run through onnxsim/simplify. If I skip that step and input an extracted part of a non-simplified model to onnx2tf, I'm getting following error

$ sne4onnx --input_onnx_file_path ../InSPyReNet/latest.opset17.no-simplify.onnx --input_op_names input --output_op_names /model/backbone/patch_embed/Transpose_1_output_0 /model/backbone/layers.0/blocks.1/Add_output_0 --output_onnx_file_path latest.opset17.no-simplify.head1.onnx
$ onnx2tf -i latest.opset17.no-simplify.head1.onnx -cotof
...
INFO:  input_name.2: model.backbone.patch_embed.norm.weight shape: [128] dtype: <class 'numpy.float32'>
INFO:  input_name.3: model.backbone.patch_embed.norm.bias shape: [128] dtype: <class 'numpy.float32'>
INFO:  output_name.1: /model/backbone/patch_embed/norm/LayerNormalization_output_0 shape: None dtype: float32
ERROR: The trace log is below.
Traceback (most recent call last):
  File "/Users/berni/git/bkg-rm-prep/onnx2tf/onnx2tf/utils/common_functions.py", line 281, in print_wrapper_func
    result = func(*args, **kwargs)
  File "/Users/berni/git/bkg-rm-prep/onnx2tf/onnx2tf/utils/common_functions.py", line 359, in inverted_operation_enable_disable_wrapper_func
    result = func(*args, **kwargs)
  File "/Users/berni/git/bkg-rm-prep/onnx2tf/onnx2tf/utils/common_functions.py", line 50, in get_replacement_parameter_wrapper_func
    func(*args, **kwargs)
  File "/Users/berni/git/bkg-rm-prep/onnx2tf/onnx2tf/ops/LayerNormalization.py", line 77, in make_node
    scale_shape_idx = list(input_tensor_shape).index(values.shape[0])
ValueError: 128 is not in list
ERROR: input_onnx_file_path: latest.opset17.head1.onnx
ERROR: onnx_op_name: /model/backbone/patch_embed/norm/LayerNormalization
ERROR: Read this and deal with it. https://github.com/PINTO0309/onnx2tf#parameter-replacement
ERROR: Alternatively, if the input OP has a dynamic dimension, use the -b or -ois option to rewrite it to a static shape and try again.
ERROR: If the input OP of ONNX before conversion is NHWC or an irregular channel arrangement other than NCHW, use the -kt or -kat option.
ERROR: Also, for models that include NonMaxSuppression in the post-processing, try the -onwdt option.

it it ok if I simplify onnx in my own pipeline or I should not do that when using onnx2tf?

Uploaded the onnx no-simplify model in case you need it: https://github.com/bernii/aut-1-5/releases/download/test/latest.opset17.no-simplify.onnx

PINTO0309 commented 1 year ago

it it ok if I simplify onnx in my own pipeline or I should not do that when using onnx2tf?

Either is fine. However, there is an important technique.

For huge onnx models, run onnxsim at least 5 times.

PINTO0309 commented 1 year ago

Fixes LayerNormalization: https://github.com/PINTO0309/onnx2tf/releases/tag/1.9.3

bernii commented 1 year ago

🙇 I can see that latest.opset17.head1.onnx definitely passes the check now

I made another extraction but getting an exception - not sure if I'm messing model with sne4onnx or that's expected 🤔

$sne4onnx --input_onnx_file_path latest.opset17.onnx --input_op_names  /model/backbone/Reshape_4_output_0 /model/backbone/Reshape_3_output_0 /model/backbone/Reshape_2_output_0 /model/backbone/Reshape_1_output_0 /model/context1/Add_output_0  --output_op_names output --output_onnx_file_path latest.opset17.head2.onnx
$python onnx2tf/onnx2tf.py -i latest.opset17.head2.onnx -cotof
...
INFO: onnx_op_type: Relu onnx_op_name: /model/attention2/conv_out3/relu/Relu
INFO:  input_name.1: /model/attention2/conv_out3/conv/Conv_output_0 shape: [1, 64, 256, 256] dtype: float32
INFO:  output_name.1: /model/attention2/conv_out3/relu/Relu_output_0 shape: [1, 64, 256, 256] dtype: float32
INFO: tf_op_type: relu
INFO:  input.1.features: name: tf.math.add_193/Add:0 shape: (1, 256, 256, 64) dtype: <dtype: 'float32'> 
INFO:  output.1.output: name: tf.nn.relu_13/Relu:0 shape: (1, 256, 256, 64) dtype: <dtype: 'float32'> 

INFO: onnx_op_type: Conv onnx_op_name: /model/attention2/conv_out4/conv/Conv
INFO:  input_name.1: /model/attention2/conv_out3/relu/Relu_output_0 shape: [1, 64, 256, 256] dtype: float32
INFO:  input_name.2: onnx::Conv_14478 shape: [1, 64, 1, 1] dtype: <class 'numpy.float32'>
INFO:  input_name.3: onnx::Conv_14479 shape: [1] dtype: <class 'numpy.float32'>
INFO:  output_name.1: /model/attention2/conv_out4/conv/Conv_output_0 shape: [1, 1, 256, 256] dtype: float32
INFO: tf_op_type: convolution_v2
INFO:  input.1.input: name: tf.nn.relu_13/Relu:0 shape: (1, 256, 256, 64) dtype: <dtype: 'float32'> 
INFO:  input.2.weights: shape: (1, 1, 64, 1) dtype: <dtype: 'float32'> 
INFO:  input.3.bias: shape: (1,) dtype: <dtype: 'float32'> 
INFO:  input.4.strides: val: [1, 1] 
INFO:  input.5.dilations: val: [1, 1] 
INFO:  input.6.padding: val: SAME 
INFO:  input.7.group: val: 1 
INFO:  output.1.output: name: tf.math.add_194/Add:0 shape: (1, 256, 256, 1) dtype: <dtype: 'float32'> 

INFO: onnx_op_type: Resize onnx_op_name: /model/Resize_3
INFO:  input_name.1: /model/attention2/conv_out3/relu/Relu_output_0 shape: [1, 64, 256, 256] dtype: float32
INFO:  input_name.2:  shape: None dtype: None
INFO:  input_name.3:  shape: None dtype: None
INFO:  input_name.4: onnx::Conv_14478 shape: (1, 64, 1, 1) dtype: <class 'numpy.float32'>
INFO:  output_name.1: /model/Resize_3_output_0 shape: [1, 64, 512, 512] dtype: float32
ERROR: The trace log is below.
Traceback (most recent call last):
  File "/onnx2tf/onnx2tf/utils/common_functions.py", line 281, in print_wrapper_func
    result = func(*args, **kwargs)
  File "/onnx2tf/onnx2tf/utils/common_functions.py", line 359, in inverted_operation_enable_disable_wrapper_func
    result = func(*args, **kwargs)
  File "/onnx2tf/onnx2tf/utils/common_functions.py", line 50, in get_replacement_parameter_wrapper_func
    func(*args, **kwargs)
  File "/onnx2tf/onnx2tf/ops/Resize.py", line 507, in make_node
    resized_tensor = Lambda(
  File "/onnx2tf/venv/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/onnx2tf/onnx2tf/utils/common_functions.py", line 980, in upsampling2d_bilinear
    return tf.compat.v1.image.resize_bilinear(
ValueError: Exception encountered when calling layer "lambda_7" (type Lambda).

Shape must be rank 1 but is rank 4 for '{{node lambda_7//model/Resize_3}} = ResizeBilinear[T=DT_FLOAT, align_corners=false, half_pixel_centers=true](Placeholder, lambda_7//model/Resize_3/size)' with input shapes: [1,256,256,64], [0,1,1,64].

Call arguments received by layer "lambda_7" (type Lambda):
  • inputs=tf.Tensor(shape=(1, 256, 256, 64), dtype=float32)
  • mask=None
  • training=None
ERROR: input_onnx_file_path: latest.opset17.head2.onnx
ERROR: onnx_op_name: /model/Resize_3
ERROR: Read this and deal with it. https://github.com/PINTO0309/onnx2tf#parameter-replacement
ERROR: Alternatively, if the input OP has a dynamic dimension, use the -b or -ois option to rewrite it to a static shape and try again.
ERROR: If the input OP of ONNX before conversion is NHWC or an irregular channel arrangement other than NCHW, use the -kt or -kat option.
ERROR: Also, for models that include NonMaxSuppression in the post-processing, try the -onwdt option.
PINTO0309 commented 1 year ago

I debugged it using the commands you provided and strangely enough the conversion succeeded without error. Although there is a transposition error somewhere in the OP, it does not appear to be a conversion error. I just copied and pasted your command, so I don't know where the possible problem is, except that the versions of onnxsim may not match.

I am surprised that the INFO: input_name.4: /model/Concat_6_output_0 shape: (4,) dtype: <class 'numpy.int64'> part does not match your log.

PINTO0309 commented 1 year ago

Incidentally, [1, 256, 256] in the input of MatMul here is mis-transposed by the tool, so you may need to do some parameter substitution. If the tensor is three-dimensional and all dimensions except the batch size are the same size, the tool will be confused by transposition operations.

image

image

replace.json ```json { "format_version": 1, "operations": [ { "op_name": "/model/context2/branch1/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context2/branch1/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context2/branch2/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context2/branch2/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/decoder/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/decoder/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context2/branch3/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context2/branch3/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] } ] } ```

image

bernii commented 1 year ago

I checked what casues the error, it's onnxsim - I was running 0.4.19 and after downgrading to 0.4.17 I'm not getting the exception during the conversion anymore. Thank you for sharing your package versions 👍

The rest of my dependencies

PINTO0309 commented 1 year ago

What? It never occurred to me that there could be a problem with onnxsim.

Thanks to you I have identified two serious problems today. I will make one more correction and release it. :smile_cat:

bernii commented 1 year ago

I'm continuing the work taking your replace.json as a starting point but struggling with

/model/context2/branch1/Hattn/MatMul_output_0

The input dimensions of transpose previous to it are not the same and yet it's Unmatched - is there a technique to fix that part too?

Screenshot 2023-04-15 at 19 15 07

Screenshot 2023-04-15 at 19 15 56

PINTO0309 commented 1 year ago

The reason why errors are detected in accuracy verification with MatMul, which is nothing, is not known at this time. There may be a problem with the accuracy verification logic itself. Therefore, for now, I corrected all MatMuls that had very large unacceptable errors. I have tried to correct the errors in the other areas except for the errors you noticed.

replace_InSPyReNet.json ```json { "format_version": 1, "operations": [ { "op_name": "/model/context2/branch1/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context2/branch1/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context2/branch2/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context2/branch2/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/decoder/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/decoder/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context2/branch3/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context2/branch3/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context3/branch1/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context3/branch1/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context3/branch2/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context3/branch2/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context3/branch3/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context3/branch3/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context4/branch1/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context4/branch1/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context4/branch2/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context4/branch2/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context4/branch3/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context4/branch3/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context5/branch1/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context5/branch1/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context5/branch2/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context5/branch2/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context5/branch3/Hattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] }, { "op_name": "/model/context5/branch3/Wattn/Transpose_1", "param_target": "attributes", "param_name": "perm", "values": [0,2,1] } ] } ```

The final outputs have been matched. Maybe the fact that the error is flattened by Softmax in the attention module is having an effect, but I don't know. There appears to be no problem, as there is not when comparing the weights of the MatMul OP of ONNX, which has errors, and the MatMul OP of TFLite, which is the output after the conversion, in Netron.

bernii commented 1 year ago

thank you for the explanation :) I will make couple more extractions with sne4onnx later today and check other parts of the model as it seems that running full model stil gives pretty random results

PINTO0309 commented 1 year ago

I would like to implement the logic to determine the correct transposition of Transformer's MatMul input someday. I can't come up with a good idea, though. :sweat:

So far I myself find Transformer conversions to be quite cumbersome.

bernii commented 1 year ago

I do not know enough about most common problems but in case of transpose it seems there could be a auto-fix mode implemented that tries to change the order of params (if they're of the same size) and see if error is gone - this way it could auto-generate replace.json . This could probably be very useful especially for big models (like this one :) ).

When it comes to replacement I'm struggling a bit with my new extraction

sne4onnx --input_onnx_file_path latest.opset17.onnx --input_op_names /model/backbone/layers.0/blocks.1/Add_output_0 /model/backbone/layers.0/blocks.1/mlp/fc2/Add_output_0 --output_op_names /model/backbone/Transpose_2_output_0 /model/backbone/Transpose_3_output_0 /model/backbone/Transpose_4_output_0 /model/backbone/Transpose_1_output_0 --output_onnx_file_path latest.opset17.head3.onnx

I'm trying to address problems around bigger errors like /model/backbone/layers.3/blocks.0/Reshape_4_output_0 - I've tried couple different combinations but seems that it's not fully helping For example

, {
      "op_name": "/model/backbone/layers.2/downsample/reduction/MatMul",
      "param_target": "outputs",
      "param_name": "/model/backbone/layers.2/downsample/reduction/MatMul_output_0",
      "post_process_transpose_perm": [0,2,1]
    }    

I'm wondering if there might be a problem with layer normalization though, while I was being able to reduce the error from 37.8 to 8.1 with the operation above, the error still seems big?

Screenshot 2023-04-17 at 00 54 35

Screenshot 2023-04-17 at 00 37 22

PINTO0309 commented 1 year ago

This model is crazy big. :sweat_smile: If you succeed in converting this model, I feel you are a hero.

I'll take a look at a few when I have time.

I do not know enough about most common problems but in case of transpose it seems there could be a auto-fix mode implemented that tries to change the order of params (if they're of the same size) and see if error is gone - this way it could auto-generate replace.json . This could probably be very useful especially for big models (like this one :) ).

I see. Might be a good idea. Thanks. I'll try various things.

PINTO0309 commented 1 year ago

So far, I have not been able to figure out what causes this meaningless Transpose to be extrapolated by TensorFlow on its own.

https://github.com/PINTO0309/onnx2tf/blob/291671019e1a9e99ea59e0b477add33ede378c99/onnx2tf/ops/LayerNormalization.py#L147-L173

image

bernii commented 1 year ago

If I see things correctly, the origin of the transpose seems to be related to the bias variable (used in the Add operation) which is defined here https://github.com/PINTO0309/onnx2tf/blob/291671019e1a9e99ea59e0b477add33ede378c99/onnx2tf/ops/LayerNormalization.py#L101-L102 and is some cases the value of bias is being transpoed https://github.com/PINTO0309/onnx2tf/blob/291671019e1a9e99ea59e0b477add33ede378c99/onnx2tf/ops/LayerNormalization.py#L90

btw there seems to be a typo there sacaled -> scaled :)

bernii commented 1 year ago

many places with high errors that I've inspected from this extraction seem to have LayerNormalization in their ancestry tree - wondering if many of those problems might be related to that 🤔

PINTO0309 commented 1 year ago

[Second step] Further separation of the model to identify problem areas. Temporarily replace LayerNormalization with Keras implementation for testing.

sne4onnx \
--input_onnx_file_path latest.opset17.head3.onnx \
--input_op_names /model/backbone/layers.1/downsample/norm/LayerNormalization_output_0 onnx::MatMul_14949 \
--output_op_names /model/backbone/layers.2/blocks.0/Add_output_0 \
--output_onnx_file_path latest.opset17.head3_test.onnx
onnx2tf -i latest.opset17.head3_test.onnx

image

From the inference results, I identified no problems with LayerNormalization.

latest.opset17.head3_test_float32.tflite.zip

However, I have noticed that errors occur when local models are extracted and transformed within the following ranges.

sne4onnx \
--input_onnx_file_path latest.opset17.head3.onnx \
--input_op_names /model/backbone/layers.2/downsample/norm/LayerNormalization_output_0 onnx::MatMul_15970 \
--output_op_names /model/backbone/layers.3/blocks.0/Add_output_0 \
--output_onnx_file_path latest.opset17.head3_test2.onnx
onnx2tf -i latest.opset17.head3_test2.onnx \
-kat /model/backbone/layers.2/downsample/norm/LayerNormalization_output_0

Note that the output of LayerNormalization is consistent in this range as well, indicating that it is the Reshape connected after LayerNormalization that is at issue.

image

latest.opset17.head3_test2_float32.tflite.zip

image

{
  "format_version": 1,
  "operations": [
    {
      "op_name": "/model/backbone/layers.3/blocks.0/norm1/LayerNormalization",
      "param_target": "outputs",
      "param_name": "/model/backbone/layers.3/blocks.0/norm1/LayerNormalization_output_0",
      "post_process_transpose_perm": [0,2,1]
    }
  ]
}
onnx2tf -i latest.opset17.head3_test2.onnx \
-kat /model/backbone/layers.2/downsample/norm/LayerNormalization_output_0 \
-prf replace_latest.opset17.head3.json

I disabled Transpose, which onnx2tf automatically extrapolates just before Reshape, and the output matched perfectly.

image

image

bernii commented 1 year ago

Thanks for checking and the explanation! I took a little detour and tried playing with model after onnx conversion with opset=16 as LayerNormalization has been introduced in opset=17 but haven't had a ton of luck there either - some transpose errors were obvious and easy to fix but others just seem tricky and no matter what I do it does not help much. I guess I'll be back to looking into the original (opset17) model instead.

PINTO0309 commented 1 year ago

Thank you. I have almost identified the problem as being with MatMul or LayerNormalization, so it will take some time, but I will try to determine the real cause.

bernii commented 1 year ago

I just did take a look into a new extraction

sne4onnx --input_onnx_file_path latest.opset17.onnx \
--input_op_names /model/backbone/layers.1/downsample/Reshape_output_0 \
--output_op_names /model/backbone/layers.2/blocks.17/Add_1_output_0 \
--output_onnx_file_path latest.opset17.head2a.onnx 

and I definitely have couple examples of errors around MatMul

1.

INFO: onnx_output_name: /model/backbone/layers.2/blocks.7/mlp/fc2/MatMul_output_0 tf_output_name: tf.linalg.matmul_48/MatMul:0 shape: (1, 4096, 512) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.0002841949462890625

Screenshot 2023-04-18 at 15 35 46 Screenshot 2023-04-18 at 15 35 09

2.

INFO: onnx_output_name: /model/backbone/layers.2/blocks.8/attn/MatMul_output_0 tf_output_name: tf.linalg.matmul_50/MatMul:0 shape: (36, 16, 144, 144) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.00010150671005249023

Screenshot 2023-04-18 at 15 38 07

Screenshot 2023-04-18 at 15 36 59

3.

INFO: onnx_output_name: /model/backbone/layers.2/blocks.8/mlp/fc1/MatMul_output_0 tf_output_name: tf.linalg.matmul_53/MatMul:0 shape: (1, 4096, 2048) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.0002193450927734375

Screenshot 2023-04-18 at 15 39 18 Screenshot 2023-04-18 at 15 39 36

PINTO0309 commented 1 year ago

Thanks to your help, I feel I have almost identified some bug in MatMul. So far I have not been able to determine what kind of implementation error is going on.

PINTO0309 commented 1 year ago

It will take some time, but the following four steps will help to isolate the problem into smaller pieces and proceed to deal with it.

PINTO0309 commented 1 year ago

I first examine how the operations around the MutMul OP, where large errors occur, are affected by the automatic optimization by TensorFlow.

INFO:
onnx_output_name: /model/backbone/layers.2/blocks.7/mlp/act/Mul_output_0
tf_output_name: tf.math.multiply_218/Mul:0 shape: (1, 4096, 2048) dtype: float32
validate_result:  Unmatched  max_abs_error: 0.0001201629638671875

image

Netron was used to extract the Weight of the TensorFlow generated MatMul and ONNX MatMul.

image

The extracted weights were loaded in Numpy and the tensor values were directly compared. This is correct, although the order of tensors is reversed from ONNX due to the tflite specification.

python
>>> import numpy as np
>>> a = np.load('tensor.tflite')
>>> a.shape
(512, 2048)
>>> b = np.load('tensor.onnx')
(2048, 512)

The tensor on the tflite side is inverted and then compared to the tensor on the onnx side. Even if the shapes are matched, the tensor values are different.

>>> a = a.transpose(1, 0)
>>> a
array([[ 0.01230378, -0.03669162, -0.03036745, ..., -0.00676962,
        -0.04225197,  0.00925315],
       [ 0.00777659,  0.02565953,  0.0113832 , ..., -0.00934533,
        -0.01455412, -0.01227233],
       [ 0.02722736,  0.031984  ,  0.00711105, ..., -0.02765727,
        -0.01011038,  0.00861716],
       ...,
       [-0.03417948,  0.00910165, -0.00887218, ..., -0.02092178,
         0.00249663, -0.00233007],
       [ 0.00433086,  0.0363299 ,  0.00444374, ...,  0.03763948,
        -0.0101961 ,  0.04696464],
       [-0.01911171, -0.0354248 , -0.03452693, ..., -0.02617664,
         0.01966653,  0.0483874 ]], dtype=float32)
>>> b
array([[ 0.0338437 , -0.05869688,  0.05814728, ...,  0.05387477,
        -0.01982893, -0.07530722],
       [ 0.04161093, -0.03468418, -0.03691841, ...,  0.03646894,
         0.02180736, -0.01494471],
       [ 0.0119237 , -0.0290256 , -0.01689629, ..., -0.10944752,
         0.05311076,  0.01313969],
       ...,
       [ 0.03222526,  0.01665373,  0.04784529, ..., -0.02973668,
        -0.01845495, -0.06931754],
       [-0.04220989,  0.00557026,  0.04425499, ...,  0.02908991,
         0.00306579,  0.04611704],
       [ 0.04056479,  0.07666754, -0.01405692, ..., -0.01036486,
         0.14133719, -0.04931771]], dtype=float32)

The reason for the Weight discrepancy may be that the previous Mul was precomputed and absorbed into the MatMal Weight. Therefore, multiply the Weight on the ONNX side by 0.5 and check if the value matches the Weight of the tflite.

>>> a
array([[ 0.01230378, -0.03669162, -0.03036745, ..., -0.00676962,
        -0.04225197,  0.00925315],
       [ 0.00777659,  0.02565953,  0.0113832 , ..., -0.00934533,
        -0.01455412, -0.01227233],
       [ 0.02722736,  0.031984  ,  0.00711105, ..., -0.02765727,
        -0.01011038,  0.00861716],
       ...,
       [-0.03417948,  0.00910165, -0.00887218, ..., -0.02092178,
         0.00249663, -0.00233007],
       [ 0.00433086,  0.0363299 ,  0.00444374, ...,  0.03763948,
        -0.0101961 ,  0.04696464],
       [-0.01911171, -0.0354248 , -0.03452693, ..., -0.02617664,
         0.01966653,  0.0483874 ]], dtype=float32)
>>> c = b * 0.5
>>> c
array([[ 0.01692185, -0.02934844,  0.02907364, ...,  0.02693738,
        -0.00991446, -0.03765361],
       [ 0.02080546, -0.01734209, -0.0184592 , ...,  0.01823447,
         0.01090368, -0.00747235],
       [ 0.00596185, -0.0145128 , -0.00844814, ..., -0.05472376,
         0.02655538,  0.00656984],
       ...,
       [ 0.01611263,  0.00832686,  0.02392264, ..., -0.01486834,
        -0.00922747, -0.03465877],
       [-0.02110494,  0.00278513,  0.02212749, ...,  0.01454496,
         0.0015329 ,  0.02305852],
       [ 0.02028239,  0.03833377, -0.00702846, ..., -0.00518243,
         0.07066859, -0.02465885]], dtype=float32)

Despite the manual calculations, the tflite weights do not match the onnx weights at all. Therefore, let us examine the maximum absolute error.

An incomprehensibly large error was displayed. This is a much larger error than the error detected by the -cotof option. Therefore, my hand calculations are incorrect.

>>> d = c - a
>>> np.max(abs(d))
0.75175154

At the moment I have not been able to reproduce the error by hand calculations, so I have not been able to identify the problem area.

daquexian commented 1 year ago

Hi @PINTO0309 I'm the author of onnxsim. Sorry for the convenience. I have fixed the bug of onnxsim in the latest version. You might want to have a try and upgrade to the latest version :)

bernii commented 1 year ago

@PINTO0309 anything I can do to help? I just tried running analysis again after pulling latest changes you made

$ sne4onnx --input_onnx_file_path latest.opset17.onnx --input_op_names /model/backbone/layers.1/downsample/Reshape_output_0 --output_op_names /model/backbone/layers.2/blocks.17/Add_1_output_0 --output_onnx_file_path latest.opset17.head2a.onnx
$ python onnx2tf/onnx2tf.py -i latest.opset17.head2a.onnx -prf myreplace.opset16.json -cotof

and errors seem similar to the ones before. For example, in my previous comment the error was 0.0002193450927734375 and now it's 0.0002765655517578125

INFO: onnx_output_name: /model/backbone/layers.2/blocks.8/mlp/fc1/MatMul_output_0 tf_output_name: tf.linalg.matmul_53/MatMul:0 shape: (1, 4096, 2048) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.0002765655517578125
PINTO0309 commented 1 year ago

Thank you.

I will be very busy until the second week of May and will not have enough time to debug and maintain at the same pace as before.

Probably related to this issue as well. Since the accuracy check fails with a simple multiplication, there is most likely a problem with the logic of the accuracy check rather than a problem with the model transformation. [DN-DAB-DETR] The output of ONNX's Mul OP is different from the TFLite's output. #327

PINTO0309 commented 1 year ago

It takes about 30 minutes for one conversion test. However, the conversion was ultimately successful. When using the -cotof option, 160 GB of RAM is required. Input resolution 1024x1024 is too large.

https://github.com/PINTO0309/onnx2tf/releases/tag/1.11.0

https://github.com/PINTO0309/onnx2tf/blob/main/json_samples/replace_InSPyReNet.json

image

https://s3.ap-northeast-2.wasabisys.com/temp-models/onnx2tf_312/latest.opset17_float32.tflite https://s3.ap-northeast-2.wasabisys.com/temp-models/onnx2tf_312/latest.opset17_float16.tflite

I have removed the Bug label so the bot will automatically close if no comments are posted for 5 days.

github-actions[bot] commented 1 year ago

If there is no activity within the next two days, this issue will be closed automatically.

bernii commented 1 year ago

did a quick test today and seems that the converted model gives perfect results, I'll do more tests over the week and report back if there's anything fishy there :)

thanks for tremendous work! @PINTO0309 🙇 🙇

bernii commented 1 year ago

I can confirm - everything works like a charm 👌 ❤️