TRT no results or totally wrong

malfonsoNeoris commented 3 years ago

Hi again. after succesfully trained two models, mobilenet_256 and resnet18_256 where 256 is the image size. now im starting the process of validating and converting to onnx and trt. Now i have two problems

[ ] For resnet.. i got the same mask ( or almos same) for all images. i will try to verify this is not a training problem.. This doesnt happen in the mobilenet version. i will retrain both with exact same dataset to verify.

if i continue the process

Conversion to onnx goes well. Same problem
Modifying onnx for trt goes well. i cant infer with this mod model. i dunno if this is expected or not.
Conversion to TRT goes well.. i have a .engine file now second problem
[ ] Infrerence with .engine goes totally wrong. with the mobilenete i got totally wrong result, like a all image bbox. the resnet model, always return empty result.

to clarrify all 3 test ( tensorflow, onnx, and trt models) were done with the exact same images. tf2 and onnx models results are the same.

Attached.. is a small script i created to test and convert (just copy pasted from the ipynb, with some minnor mods) inference.zip

can you giveme some direction for where to look for this errors ? thanks again!

alexander-pv commented 3 years ago

Hi, @malfonsoNeoris,

Thank you for the code in the attachment. I'll learn it a bit later and help you to figure the problem out. Modified .onnx graph is not valid for onnxruntime because its nodes are specially prepared for TensorRT.

malfonsoNeoris commented 3 years ago

hi alexander. just an update. For the first issue, i have re trained. same dataset mobilenet and resnet18/50, imgsize 256. mobilenete work as a charm. both resnet have same problem.. always almost same result for different images.

would copy some image result help to undertand the problem ?

xuatpham commented 3 years ago

Hi @alexander-pv , thanks for your effort.

I've successfully converted a trained tensorflow-model to ONNX and from ONNX to the modified_ONNX.

After that from modified_ONNX to TRT was successfully as well.

But the result of TRT seems too much different from the original tensorflow_model.

Is that normal when you converted from to TRT ?

Help to advice or suggest me how can I improve the TRT result or somewhere I can touch into and modify the modified_ONNX.

Hello @malfonsoNeoris , How're you doing? are you able to get the good result from TRT?

Once again, thanks all.

alexander-pv commented 3 years ago

Hi, @malfonsoNeoris , @xuatpham

Sorry for the rather long answer.

I have trained several models with the balloon dataset and I can say that there is an error somewhere in the construction of the ONNX graph for TRT. Sometimes NaNs happens in the TensorRT model output. At the moment, I have found and fixed an error in the data normalization and Zero Padding configuration in the ONNX graph. The mAP increased a bit, but I continue to see periodic NaNs in the output of TRT models. I started to note repository changes here.

I plan to compare the subgraphs outputs of the tensorflow/onnx with the tensorrt-optimized version. It is highly likely that this way it will be possible to find the location of the problem in the modified graph.

@xuatpham, you can open ./src/common/inference_optimize.py. Here I add up all the functions for working with the ONNX graph. modify_onnx_model function prepares ONNX model for TensorRT. You can experiment with the graph modification function or also generate subgraphs, optimize them with TensorRT and check the differences in the outputs with the original model.

Also, please do not forget to update nvinfer_plugin, since the default mrcnn_config.h header of proposalLayerPlugin may be different from the python model config.

An interesting fact is that for efficientnet and mobilenet backbones mAP drop is quite small.

xuatpham commented 3 years ago

Hi, @malfonsoNeoris , @xuatpham

Sorry for the rather long answer.

I have trained several models with the balloon dataset and I can say that there is an error somewhere in the construction of the ONNX graph for TRT. Sometimes NaNs happens in the TensorRT model output. At the moment, I have found and fixed an error in the data normalization and Zero Padding configuration in the ONNX graph. The mAP increased a bit, but I continue to see periodic NaNs in the output of TRT models. I started to note repository changes here.

I plan to compare the subgraphs outputs of the tensorflow/onnx with the tensorrt-optimized version. It is highly likely that this way it will be possible to find the location of the problem in the modified graph.

@xuatpham, you can open ./src/common/inference_optimize.py. Here I add up all the functions for working with the ONNX graph. modify_onnx_model function prepares ONNX model for TensorRT. You can experiment with the graph modification function or also generate subgraphs, optimize them with TensorRT and check the differences in the outputs with the original model.

Also, please do not forget to update nvinfer_plugin, since the default mrcnn_config.h header of proposalLayerPlugin may be different from the python model config.

An interesting fact is that for efficientnet and mobilenet backbones mAP drop is quite small.

Thank you Alex, will have a look over that. Yes, I saw many NaN values when converting to TRT.

But as my experiment, beside the results are quite different from the original, it seems like all the masks had been moved in the same direction so probably there is a problem with resize function, I guess.

Anyway, don't forget to let us know if you can fix the NaNs values when converting to TRT. Thanks a lot.

dk-chun commented 2 years ago

Hi @alexander-pv. First of all, many THANKS to your hard work.

I have a question about TRT results which looks different from TF, ONNX runtime. 1) Detection scores are different 2) mask has incomplete shape comparatively (looks a bit fuzzy) 3) some of detections are missing

I roughly guess this is from different implementation between TF codes and TRT plugins (ProposalLayer_TRT, PyramidROIAlign_TRT, DetectionLayer_TRT).

Do you have a way to get same results without loss ? Please give a comment. Thank you.

alexander-pv commented 2 years ago

Hi, @dk-chun,

I am glad that you find the repo useful. AFAIK, TRT plugins were written based on the original matterport model implementation. I believe that there are 2 points that lead to the distorted result in TRT.

First, ONNX graph modification for TRT porting that happens in modify_onnx_model function may contain mistakes. I have found recently wrong zero padding nodes modifications and will push changes to maskrcnn_tf2.5 develop branch after some tests ASAP. The first experiments show a closer result to TF&ONNX models.

Second, nvinfer_plugin should be recompiled according to the customized model config. Otherwise, TRT plugins may really work wrong or even segmentation fault errors can occur.

alexander-pv / maskrcnn_tf2

TRT no results or totally wrong #3