alexander-pv / maskrcnn_tf2

Mask R-CNN for object detection and instance segmentation with Keras and TensorFlow V2 and ONNX and TensorRT optimization support.
Other
40 stars 11 forks source link

TRT no results or totally wrong #3

Open malfonsoNeoris opened 3 years ago

malfonsoNeoris commented 3 years ago

Hi again. after succesfully trained two models, mobilenet_256 and resnet18_256 where 256 is the image size. now im starting the process of validating and converting to onnx and trt. Now i have two problems

if i continue the process

to clarrify all 3 test ( tensorflow, onnx, and trt models) were done with the exact same images. tf2 and onnx models results are the same.

Attached.. is a small script i created to test and convert (just copy pasted from the ipynb, with some minnor mods) inference.zip

can you giveme some direction for where to look for this errors ? thanks again!

alexander-pv commented 3 years ago

Hi, @malfonsoNeoris,

Thank you for the code in the attachment. I'll learn it a bit later and help you to figure the problem out. Modified .onnx graph is not valid for onnxruntime because its nodes are specially prepared for TensorRT.

malfonsoNeoris commented 3 years ago

hi alexander. just an update. For the first issue, i have re trained. same dataset mobilenet and resnet18/50, imgsize 256. mobilenete work as a charm. both resnet have same problem.. always almost same result for different images.

would copy some image result help to undertand the problem ?

xuatpham commented 3 years ago

Hi @alexander-pv , thanks for your effort.

I've successfully converted a trained tensorflow-model to ONNX and from ONNX to the modified_ONNX.

After that from modified_ONNX to TRT was successfully as well.

But the result of TRT seems too much different from the original tensorflow_model.

Is that normal when you converted from to TRT ?

Help to advice or suggest me how can I improve the TRT result or somewhere I can touch into and modify the modified_ONNX.

Hello @malfonsoNeoris , How're you doing? are you able to get the good result from TRT?

Once again, thanks all.

alexander-pv commented 3 years ago

Hi, @malfonsoNeoris , @xuatpham

Sorry for the rather long answer.

I have trained several models with the balloon dataset and I can say that there is an error somewhere in the construction of the ONNX graph for TRT. Sometimes NaNs happens in the TensorRT model output. At the moment, I have found and fixed an error in the data normalization and Zero Padding configuration in the ONNX graph. The mAP increased a bit, but I continue to see periodic NaNs in the output of TRT models. I started to note repository changes here.

I plan to compare the subgraphs outputs of the tensorflow/onnx with the tensorrt-optimized version. It is highly likely that this way it will be possible to find the location of the problem in the modified graph.

@xuatpham, you can open ./src/common/inference_optimize.py. Here I add up all the functions for working with the ONNX graph. modify_onnx_model function prepares ONNX model for TensorRT. You can experiment with the graph modification function or also generate subgraphs, optimize them with TensorRT and check the differences in the outputs with the original model.

Also, please do not forget to update nvinfer_plugin, since the default mrcnn_config.h header of proposalLayerPlugin may be different from the python model config.

An interesting fact is that for efficientnet and mobilenet backbones mAP drop is quite small.

xuatpham commented 3 years ago

Hi, @malfonsoNeoris , @xuatpham

Sorry for the rather long answer.

I have trained several models with the balloon dataset and I can say that there is an error somewhere in the construction of the ONNX graph for TRT. Sometimes NaNs happens in the TensorRT model output. At the moment, I have found and fixed an error in the data normalization and Zero Padding configuration in the ONNX graph. The mAP increased a bit, but I continue to see periodic NaNs in the output of TRT models. I started to note repository changes here.

I plan to compare the subgraphs outputs of the tensorflow/onnx with the tensorrt-optimized version. It is highly likely that this way it will be possible to find the location of the problem in the modified graph.

@xuatpham, you can open ./src/common/inference_optimize.py. Here I add up all the functions for working with the ONNX graph. modify_onnx_model function prepares ONNX model for TensorRT. You can experiment with the graph modification function or also generate subgraphs, optimize them with TensorRT and check the differences in the outputs with the original model.

Also, please do not forget to update nvinfer_plugin, since the default mrcnn_config.h header of proposalLayerPlugin may be different from the python model config.

An interesting fact is that for efficientnet and mobilenet backbones mAP drop is quite small.

Thank you Alex, will have a look over that. Yes, I saw many NaN values when converting to TRT.

But as my experiment, beside the results are quite different from the original, it seems like all the masks had been moved in the same direction so probably there is a problem with resize function, I guess.

Anyway, don't forget to let us know if you can fix the NaNs values when converting to TRT. Thanks a lot.

dk-chun commented 2 years ago

Hi @alexander-pv. First of all, many THANKS to your hard work.

I have a question about TRT results which looks different from TF, ONNX runtime. 1) Detection scores are different 2) mask has incomplete shape comparatively (looks a bit fuzzy) 3) some of detections are missing

I roughly guess this is from different implementation between TF codes and TRT plugins (ProposalLayer_TRT, PyramidROIAlign_TRT, DetectionLayer_TRT).

Do you have a way to get same results without loss ? Please give a comment. Thank you.

alexander-pv commented 2 years ago

Hi, @dk-chun,

I am glad that you find the repo useful. AFAIK, TRT plugins were written based on the original matterport model implementation. I believe that there are 2 points that lead to the distorted result in TRT.

First, ONNX graph modification for TRT porting that happens in modify_onnx_model function may contain mistakes. I have found recently wrong zero padding nodes modifications and will push changes to maskrcnn_tf2.5 develop branch after some tests ASAP. The first experiments show a closer result to TF&ONNX models.

Second, nvinfer_plugin should be recompiled according to the customized model config. Otherwise, TRT plugins may really work wrong or even segmentation fault errors can occur.