Closed AlphaPlusTT closed 8 months ago
Thanks for your interest. Due to that our deeper downsampling layers (section 3.3 in the paper) include two layers, eg, line 358 and line 359, we use the feature map of the layer with the out_indices of 2. Using the feature map of line 358 and line 362 should give similar results.
https://github.com/THU-MIG/RepViT/blob/338506bc71d7b4e008cb6f1d94a559e8d8f969c6/detection/configs/mask_rcnn_repvit_m1_1_fpn_1x_coco.py#L15 The out_indices are [2, 6, 20, 24] here. In addition to the layers defined by self.cfgs, there is also a patch_embed layer placed at the starting position. https://github.com/THU-MIG/RepViT/blob/338506bc71d7b4e008cb6f1d94a559e8d8f969c6/model/repvit.py#L228-L236 Therefore, the layer with out_indices 2 corresponds to line 357 in the following code, and the layer with out_indices 6 corresponds to line 361 in the following code. This differs from common settings because we typically save the feature map of a layer before downsampling. Why do you want to set it up like this? https://github.com/THU-MIG/RepViT/blob/338506bc71d7b4e008cb6f1d94a559e8d8f969c6/model/repvit.py#L354-L380