samples/python/detectron2 image resolution

habjoel commented 2 years ago

Hi there,

I have been following the tutorial on samples/python/detectron2 to deploy a Mask-RCNN model to TensorRT. However, the following sentence makes me wonder: Detectron 2 Mask R-CNN R50-FPN 3x model is dynamic with minimum testing dimension size of 800 and maximum of 1333

I keep reading these image sizes in relation with Mask-RCNN models. E.g. on this page from ONNX on Mask-RCNN, it states something similar:

This model can take images of different sizes as input. However, to achieve best performance, it is recommended to resize the image such that both height and width are within the range of [800, 1333], and then pad the image with zeros such that both height and width are divisible by 32.

Also, MMDetection which offers conversion scripts to TensorRT through MMDeploy, has their standard export config set to a range of [800, 1344]. As can be seen here

Clearly, this cannot be a coincidence. So my first question is: Why is everything exported/optimized to that image resolution range? Does it have something to do with the images from the COCO Dataset? Furthermore, I would really like to use a higher resolution than [1344, 1344]. Is that even possible with Mask-RCNN networks?

Thanks a lot for your help!

zerollzeng commented 2 years ago

@kevinch-nv Can you help triage it? thanks!

azhurkevich commented 2 years ago

@habjoel I can tell you the answer but it will be a long one.

Let me explain why exactly 1,344x1,344. If we look at the config file of Detectron 2 Mask R-CNN R50-FPN 3x model, we will find variables MIN_SIZE_TEST and MAX_SIZE_TEST, with 800 and 1,333 values respectively. This Mask R-CNN will try to upsample shortest dimension to 800 first, secondly upsampling longer dimension to whatever the number is, while keeping the same aspect ratio. However, there is a max of 1,333. Hence if longer dimension reaches a limit of 1,333 before shorter dimension reaches 800, longer dimension is kept at 1,333 and shorter dimension is kept at the upscaled dimension keeping the same aspect ratio again. For example (COCO): 000000000285.jpg's true dimensions (640, 586) → upscaled (874, 800); 000000000724.jpg's true dimensions (500, 375) → upscaled (1067, 800); 000000001490.jpg's true dimensions (315, 640) → upscaled (656, 1333). This logic is implemented in Detectron 2 ResizeShortestEdge augmentations class. On top of that every single dimension must be divisible by 32, this is a requirement that emerges from resulting feature maps. As a result, in cases where one of the dimensions is 1,333, it is upscaled further to 1,344 (1344/32=42), shortest dimension is also upscaled to maintain aspect ratio and divisibility requirement. Hence, in order to cover every single case it is easier to just replicate Detectron 2 ResizeShortestEdge augmentations behavior and pad rest of the image up to 1,344x1,344. This was a great resource I have used to learn a lot about this architecture, if you want to know more you can read original Mask R-CNN paper.

Also, keep in mind very important fact. Preprocessor is part of the NN graph. As a result if you pad with 0s up to 1,344x1,344, preprocessor Sub and Div nodes will not produce padded with 0s images, 0s will be altered. We need padded with 0s images as they get into the first Conv. Hence we pad with values that will be reversed and will become 0s as a result of preprocessor subtraction and vision.

To answer your questions more specifically: "Why is everything exported/optimized to that image resolution range?" - I think the best answer is that people who architected Detectron 2 Mask R-CNN R50-FPN 3x model decided it is the best resolution range for their use case (COCO dataset). However, it's better to ask Detectron2 people for the most clear answer.

"Does it have something to do with the images from the COCO Dataset?" - I think yes. If you have a different dataset, using different resolution might make more sense. However, it's heavily dataset related.

"Furthermore, I would really like to use a higher resolution than [1344, 1344]. Is that even possible with Mask-RCNN networks?" - Yes, sure. If your use case is different from default COCO dataset trained model, you can use a different resolution if you need to. Just change 1,344 to:

aug = T.ResizeShortestEdge(
    [my_res_int, my_res_int], my_res_int
)

and you will be good to go. Also in order to export to ONNX using detectron2, you have to make sure that model will be able to detect something on the image, otherwise there is a specific assertion that will not allow you to export the model.

habjoel commented 2 years ago

@azhurkevich wow, thanks a lot for this long but very informative answer, I appreciate it!

Just to be 100% sure…

Can I take a Mask-RCNN model that has been pretrained with COCO and then do transfer learning using that exact model and simply change the input size to match the dataset I would like to use for transfer learning? (such that the training images are not all resized to the range [800, 1344]) Or would that somehow change the weights required for the model?

What I actually want to ask is: Can I expect it to work if I simply take a Mask-RCNN model pretrained on COCO and change the input size as you described above for further transfer learning with higher resolution images and then finally also inference (on TensorRT) with higher resolution images?

Or will I not be able to use the pretrained COCO weights if I change the input size meaning that I would need to change the input size of the model first and then train from scratch on COCO (obviously resized to match the new input size) in order to make use of that dataset.

I would obviously prefer the first method as it would be quite chumbersome to train the model from scratch without pretrained weights.

Thanks a lot for you‘re help!

azhurkevich commented 2 years ago

@habjoel I think here we are getting into a more general ML questions rather than TRT stuff. However, I'll give you my opinion. Keep in mind, I am an engineer who wrote this sample and I know plenty about CV, but I am not best qualified to solve your specific task since I don't know the details. In the end, your judgment is the best tool you have.

Question 1. Weights will not be changed unless you retrain. You can do multiple TL approaches. You can keep it the way it is and solve other problem. You can retrain a bit or as much as you want with new dataset. You can create a couple layers on top of existing model and train them based on your specific dataset, while keeping rest of the model unchanged. Last approach will probably break the converter because you are customizing the model. However, it will be very easy to adjust the converter to make it work. I left a ton of comments in code to help people figure out what's going on. In the end it's up to you.

Question 2. I don't know, nobody knows except you. I don't know your dataset, plus nobody knows how it's gonna play out in the end. You just have to try and see. Theoretically it should work, unless it is something completely unreasonable.

Keeping the weights from a model trained with different dimensions might still work. Typically it does. However, it all comes down to your specific use case (Please do not ask me about your specific use case because I still won't be able to provide you an answer). You'll just have to trial and error and see where it leads you. Good luck researching!

Please let us know if we can help you (best if it is TRT related).

habjoel commented 2 years ago

@azhurkevich alright, I‘ll leave it at that, thanks a lot for your help! I know that you might not be the best qualified to discuss this but it has been hard to find a good answer to this issue so I appreciate your effort a lot! :)

letmejoin commented 2 years ago

@azhurkevich

Hello， thanks for your sample. I have one question with my application. In my case, I use faster-rcnn_x101_fpn model of detectron2. In my dataset, the resolution of images is not 1:1, and when I finetune the pre-trained d2 model, I use resolution like 1344X2048. But in your sample, the width and higth of image is 1344X1344, when I try to chenge to aug = T.ResizeShortestEdge([1344, 1344], 2048), and change the create_onnx.py to faster-rcnn model. After that, the trt model has low inference accuracy. Is my changes reasonable? Or if you have some advise? Must I use 1344x1344 to re-train the d2 model? Thanks!!!

azhurkevich commented 2 years ago

@letmejoin I am OOTO for now, but will try to answer. There are not strict requirements for resolution, it's just with a publicly available vanilla model it will have bad mAP otherwise. 1344x2048 is fine if you have trained it. Make sure that after conversion you definitely have 1344x2048 input by visualizing the ONNX with Netron. It is hard to say why your model is not performing up to your standards since it is a custom model and I cannot guarantee that converter will work with one out of the box. Mostly because I cannot predict the changes that are made in the graph. Converter is very specific about graph nodes, if something changes, it will not perform as expected. Most likely the modifications that you've made are incorrect or nodes in the graph changed, hence converter didn't grab expected nodes or something is up with your input and the way you preprocess the images (maybe you need a different preprocess than the one I have in image_batcher.py). I know people successfully converted Faster R-CNNs by modifying my converter, so it definitely can be done. You best bet is visualizing graphs for vanilla and your model before and after the conversion with Netron and checking if something doesn't make sense. Look at vanilla graph before and after, then look at yours. Check nodes. Please read my code and comments, they will help tremendously. Also I recommend this material.

letmejoin commented 2 years ago

@azhurkevich Very thanks and Sorry for the late reply. I will study under your advice. PS: a little question, the function of create_onnx.py--get_anchors()--imagelist_images = ImageList.from_tensors(images, 1344). your sample choosing 1344. Is this related with the input resolution 1344x1344. If the input changing to 1344x2048, dose the 1344 for this function still correct ? Or the parameter is related with the input resolution? In my practice, changing 1344 to 32 obtains better mAP.

NVIDIA / TensorRT

samples/python/detectron2 image resolution #2142