Some questions for TRT conversion

malfonsoNeoris commented 3 years ago

Hi, my name is manuel, and first of all THAAAAAAAAAAAAANKS for this update of the library. im still learning about image segmentation on nvidia xavier, and most of the libraries/models that i tried didnt work ( even nvidia tutorials), there was always a but, and something start failed. i just clone your repo, change some configs to my custom coco dataset, trained and tested like a charm. i have even been able to convert it to onnx wihout problem.

Now i want to make the last step for converting to TRT and the problem are starting. img stuck at the modify_onnx_model as it seems some of the layers have changed or have different name For example, for the ResizeNearestTrt.. 'mobilenet': ['Resize303:0', 'Resize313:0', 'Resize__323:0'], but when looking.. i can find the 304, 314 and 324. then for example there are no tf_op_layer_AddV2 ( or maybe they have other names)

Can you please help me to

[x] undertand why they are different
[x] know what to change

Some thing i must tell is that i didnt follow your requirements file to perfection.. i had already most of the things installed and just tryed if they work befor doing a new env.. Tensorflow 2.4.1 Keras 2.4.3 onnx-runtime 1.8 onnx 1.9 tf2onnx 1.9. onnx-graphsurgeon 0.3.11

is there a possibility os layers names changes due to this ?

and if you allowme two different questions in one. i would like to try for a resnet18/34... its possible ? how should i modify the config?

thanks for any response in advance

alexander-pv commented 3 years ago

Hi, Manuel @malfonsoNeoris :hand:,

I'm glad that you found this repository useful.

1.Why layers names were different:

To my knowledge, tensorflow.keras.layers naming convention is sensitive to multiple models building during one session. That's why there is tensorflow.keras.backend.clear_session() line in weights transferring from training graph to pure inference graph. However, after your question, I made a bit of research and found that tf2onnx is also sensitive to batch size. This was the cause of the Resize__xyz upsampling layers problem. So I fixed it by removing several hardcoded layer names with a more flexible regexp search. Besides, there is an update in the tensorflow library structure and layers naming from v2.4. For example, tf_op_layer_AddV2/AddV2 becomes tf.__operators__.add/AddV2. I recently tested tensorflow 2.3, 2.4, 2.5 and now everything should be converted and modified without any problems (with a little trick for 2.5). Now you can pull the updates from master branch.

Know what to change:

There are several apps for graphs inspecting which supports many formats. Generally, I use netron for nns layer-by-layer analysis. It will definitely help you understand when something is wrong with the graph. I believe that future versions of tensorflow, tf2onnx may only slightly transform the names of the layers. But the operation of the layer should be constant such as addition in tf.__operators__.add/AddV2 and tf_op_layer_AddV2/AddV2. In netron, there is a search by the node name or part of it. So you can easily find out the new node/layer name. But I hope that regexp should be enough :smile: In general, I had no problems with the entire model, except for summing convolutional layers and upsampling layers which you encountered.

Summing before .onnx graph modification for TRT :

upsampling_tf2onnx_raw

Summing after .onnx graph modification for TRT :

upsampling_trt_mod

Managing backbones:

You can inspect the general model config in src/common/config.py Everything is stored in CONFIG dictionary. There is a backbone key in it where you can add any backbone stated in README.md. For example, if you want resnet18 feature extractor, update the dictionary that you pass to function that builds a model:

CONFIG.update({'backbone': 'resnet18'}) model = mask_rcnn_functional(config=CONFIG)

malfonsoNeoris commented 3 years ago

MARVELLUIOS! i will try it asap!

malfonsoNeoris commented 3 years ago

Hi alexander,just to tellyou that the conversion to trt worked My model is Backbone: mobilnet input_size: 512x512

after conversion to fp16.engine y got ~8 FPS for inference ~2.5 gbs of ram used tested on a Xavier NX. Are these values expected? or i should have higher FPS or lower memory consuption?

again, thanks for your amazing work

alexander-pv commented 3 years ago

Hi, Manuel, I'm glad to hear that. Thank you for the information! Unfortunately, I don't have Xavier NX for tests, but in comparison to Xavier AGX the result looks plausible.

I am planning to add the NCWH input order option later, which should increase the inference speed a bit.

alexander-pv / maskrcnn_tf2

Some questions for TRT conversion #1