ZhangGe6 / onnx-modifier

A tool to modify ONNX models in a visualization fashion, based on Netron and Flask.
MIT License
1.28k stars 157 forks source link

Modify batch failed #74

Closed Peize-Liu closed 1 year ago

Peize-Liu commented 1 year ago

Dear team,

I want to express my deep appreciation for your outstanding work. I recently made modifications to an ONNX model obtained from the ONNX ZOO hitnet repository, which can be found at this link. The original input shape of the model is [1, 2, 320, 240], but I have made changes to enable batch processing with a shape of [4, 2, 320, 240]. I successfully applied these modifications using the onnx-modifier tool.

However, I have encountered an issue with the output when calling the ONNX model using the ONNX Runtime API. Although the output displayed in onnx-modifier appears to be correct, the output remains unaltered when calling the model through the ONNX Runtime API.

onnx_model.zip

Peize-Liu commented 1 year ago

Here is the ERRO INFO [W:onnxruntime:, graph.cc:108 MergeShapeInfo] Error merging shape info for output. 'reference_output_disparity' source:{1,240,320,1} target:{4,240,320,1}. Falling back to lenient merge. model /home/khalil/workspace/onnx-modifier/modified_onnx/modified_model_float32.onnx info: input name: input, shape: [4, 2, 240, 320] output name: reference_output_disparity, shape: [1, 240, 320, 1]

ZhangGe6 commented 1 year ago

@Peize-Liu Thanks for reporting. This issue is reproduced and I am looking into it.

ZhangGe6 commented 1 year ago

@Peize-Liu This error info is invoked when the shape value "saved in ONNX metadata" and the shape value "in the runtime“ are inconsistent. In this case, the shape value saved in the ONNX metadata is [4, 2, 240, 320], but the shape in the runtime is [1, 240, 320, 1].

In your model, there is a slice op before the model output. Its starts value is [0, 0, 0, 0] and its end value is [1, 240, 320, 1]. It seems that the model will only output the inference result of the 1st batch, regardless of the input batch value. Is it an expected behavior?

Peize-Liu commented 1 year ago

@ZhangGe6 Acutally not, the oringal net is a stereo depth estimation net, which takes two [1 3 240 320] images and then output a depth image in dim [1 240 320 1 ], therefore, I think after changing the net into batch mode like [ 4 3 240 320 ] for input, the output should be [4 240 320 1]

ZhangGe6 commented 1 year ago

@Peize-Liu Please remember to edit the end value of the last slice op from [1, 240, 320, 1] to [4, 240, 320, 1], after changing batch size to 4. Then the model can do inference without any errors or warnings. I think it is a design issue of the original model.

Peize-Liu commented 1 year ago

Thank you a lot! I will have a try

Peize-Liu commented 1 year ago

@ZhangGe6
image Sorry for bothering you. I modified CREStereoNet via ONNX-Modifier from batch size 1 to 4. However, the modified seems to be wrong at the first Contact layer. I push models in this CREStereo Models. I'd appreciate if you have time to point out where the problem is.

ZhangGe6 commented 1 year ago

the modified CREStereoNet seems to be wrong at the first Contact layer.

@Peize-Liu Got it. I'll look into it.

BTW, does the "hitnet" with batch size 4 work correctly?

Peize-Liu commented 1 year ago

the modified CREStereoNet seems to be wrong at the first Contact layer.

@Peize-Liu Got it. I'll look into it.

BTW, does the "hitnet" with batch size 4 work correctly?

Yes, It works, thank you very much for your advice

ZhangGe6 commented 1 year ago

@Peize-Liu Hi, I figured it out. It is a bug in the code and has been fixed. Please update to the latest code and have a try. Thanks for reporting!

issue74

This is a brief explanation for the bug: As the previous change batch size function is implemented by replacing the batch size meta-data of all the nodes with the same value, It can't work correctly when a transformation on batch dim is invoked, which is exactly the 1st Contact node in CREStereoNet does.

In the latest code, the change batch size function is implemented using shape inference, rather than the previous hard-coded way, and the issue is expected to be fixed. Feel free for more discussions if any problem still exists.

Peize-Liu commented 1 year ago

@Peize-Liu Hi, I figured it out. It is a bug in the code and has been fixed. Please update to the latest code and have a try. Thanks for reporting!

issue74

This is a brief explanation for the bug: As the previous change batch size function is implemented by replacing the batch size meta-data of all the nodes with the same value, It can't work correctly when a transformation on batch dim is invoked, which is exactly the 1st Contact node in CREStereoNet does.

In the latest code, the change batch size function is implemented using shape inference, rather than the previous hard-coded way, and the issue is expected to be fixed. Feel free for more discussions if any problem still exists.

Thank you every much for your efforts, I feel that there is still an issue with the ouput dim? Should I locally fix this or it should be done with onnx-modifer, the expect output dim should be [4 2 240 320] I guess image I have tested the modified model with tensorrt, it can be excuted properly but output dim

ZhangGe6 commented 1 year ago

@Peize-Liu Similar to "hitnet", there are also ops that are configured for batch size 1 exclusively. For example, after changing batch size to 4, we need to edit the split value of op init_Split_115 from 1, 1 to 4, 4. image

However, to make the ONNX model compatible with batch size 4, there may be still a long way to go, as there are other ops that are configured for batch size 1, and the model is very complex. In this case, It would be more efficient if you could export an ONNX model for batch size 4, rather than exporting an ONNX model for batch size 1 and then editing it.

Peize-Liu commented 1 year ago

@Peize-Liu Similar to "hitnet", there are also ops that are configured for batch size 1 exclusively. For example, after changing batch size to 4, we need to edit the split value of op init_Split_115 from 1, 1 to 4, 4. image

However, to make the ONNX model compatible with batch size 4, there may be still a long way to go, as there are other ops that are configured for batch size 1, and the model is very complex. In this case, It would be more efficient if you could export an ONNX model for batch size 4, rather than exporting an ONNX model for batch size 1 and then editing it.

Exactly, thank you every much for your work and project. It realy saves time for researchers who are not familiar with machine learning area. Thank you again for your great job, I have learnt a lot from this issue.