THU-MIG / RepViT

RepViT: Revisiting Mobile CNN From ViT Perspective [CVPR 2024] and RepViT-SAM: Towards Real-Time Segmenting Anything
https://arxiv.org/abs/2307.09283
Apache License 2.0
799 stars 60 forks source link

About the out_indices in detection task config #40

Closed AlphaPlusTT closed 8 months ago

AlphaPlusTT commented 8 months ago

https://github.com/THU-MIG/RepViT/blob/338506bc71d7b4e008cb6f1d94a559e8d8f969c6/detection/configs/mask_rcnn_repvit_m1_1_fpn_1x_coco.py#L15 The out_indices are [2, 6, 20, 24] here. In addition to the layers defined by self.cfgs, there is also a patch_embed layer placed at the starting position. https://github.com/THU-MIG/RepViT/blob/338506bc71d7b4e008cb6f1d94a559e8d8f969c6/model/repvit.py#L228-L236 Therefore, the layer with out_indices 2 corresponds to line 357 in the following code, and the layer with out_indices 6 corresponds to line 361 in the following code. This differs from common settings because we typically save the feature map of a layer before downsampling. Why do you want to set it up like this? https://github.com/THU-MIG/RepViT/blob/338506bc71d7b4e008cb6f1d94a559e8d8f969c6/model/repvit.py#L354-L380

jameslahm commented 8 months ago

Thanks for your interest. Due to that our deeper downsampling layers (section 3.3 in the paper) include two layers, eg, line 358 and line 359, we use the feature map of the layer with the out_indices of 2. Using the feature map of line 358 and line 362 should give similar results.