facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30k stars 7.41k forks source link

[feature request] HTC roi head / configs and HTC++ improvements #4379

Open vadimkantorov opened 2 years ago

vadimkantorov commented 2 years ago

Hi @lyttonhao @HannaMao!

Exploring Plain Vision Transformer Backbones for Object Detection and MViTv2: Improved Multiscale Vision Transformers for Classification and Detection mentions experiments with HTC/HTC++ detector architectures for https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet / https://github.com/facebookresearch/mvit. Are HTC/HTC++ detector architectures implemented anywhere in detectron2?

Thanks!

lyttonhao commented 2 years ago

Hi, we don't use HTC/HTC++ detectors in the papers of MViTv2 or ViTDet.

vadimkantorov commented 2 years ago

You are right, sorry! I got confused in comparisons. I'll repurpose this issue as a feature request then!

HTC impl in mmdet: https://github.com/open-mmlab/mmdetection/blob/master/mmdet/models/roi_heads/htc_roi_head.py, https://github.com/open-mmlab/mmdetection/tree/master/configs/htc

On HTC++: Swin paper (https://github.com/microsoft/Swin-Transformer/issues/113#issuecomment-993127866):

For system-level comparison, we adopt an improved
HTC [9] (denoted as HTC++) with instaboost [22], stronger
multi-scale training [7] (resizing the input such that the
shorter side is between 400 and 1400 while the longer side
is at most 1600), 6x schedule (72 epochs with the learning
rate decayed at epochs 63 and 69 by a factor of 0.1), softNMS [5], and an extra global self-attention layer appended
at the output of last stage and ImageNet-22K pre-trained model as initialization. We adopt stochastic depth with ratio of 0.2 for all Swin Transformer models.

SwinV2 paper:


COCO object detection We adopt HTC++ [10, 46] for
experiments. In data pre-processing, Instaboost [23], a
multi-scale training [26] with an input image size of
1536×1536, a window size of 32×32, and a random scale
between [0.1, 2.0] are used. An AdamW optimizer [48] with
an initial learning rate of 4 × 10−4 on batch size of 64, a
weight decay of 0.05, and a 3× scheduler are used. The
backbone learning rate is set 0.1× of the head learning rate.
In inference, soft-NMS [5] is used. Both single-scale and
multi-scale test results are reported.```