Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

5g4s commented 1 year ago

In this paper, we have presented Mask DINO as a unified Transformer-based framework for both object detection and image segmentation. Mask DINO shows that detection and segmentation can help each other in query-based models.

DETR adopts a set-prediction objective and eliminates hand-crafted modules such as anchor design and non-maximum suppression.

5g4s commented 1 year ago

Although DETR addresses both the object detection and panoptic segmentation tasks, its segmentation performance is still inferior to classical segmentation models.

5g4s commented 1 year ago

In Transformer-based models, the best-performing detection and segmentation models are still not unified, which prevents task and data cooperation between detection and segmentation tasks.

5g4s commented 1 year ago

It naturally leads to two questions: 1) why cannot detection and segmentation tasks help each other in Transformer-based models? and 2) is it possible to develop a unified architecture to replace specialized ones?

5g4s commented 1 year ago

First, we propose a unified and enhanced query selection. Second, we propose a unified denoising training for masks to accelerate segmentation training. Third, we use a hybrid bipartite matching for more accurate and consistent matching from ground truth to both boxes and masks.

5g4s commented 1 year ago

To summarize, our contributions are three-fold. 1) We develop a unified Transformer based framework for both object detection and segmentation. 2) We demonstrate that detection and segmentation can help each other through a shared architecture design and training method. 3) We also show that, via a unified framework, segmentation can benefit from detection pre-training on a large-scale detection dataset.

5g4s commented 1 year ago

Mask DINO adds another branch for mask prediction and minimally extends several key components in detection to fit segmentation tasks.

5g4s commented 1 year ago

Why cannot Mask2Former do detection well? First, its queries follow the design in DETR [1] without being able to utilize better positional priors as studied in Conditional DETR [26], Anchor DETR [34], and DABDETR [22].

Second, Mask2Former adopts masked attention (multi-head attention with attention mask) in Transformer decoders. The attention masks predicted from a previous layer are of high resolution and used as hard-constraints for attention computation.

Third, Mask2Former cannot explicitly perform box refinement layer by layer.

5g4s commented 1 year ago

segmentation branch

5g4s commented 1 year ago

Unified query selection for mask Query selection has been widely used in traditional two-stage models [28] and many DETR-like models [37, 40] to improve detection performance.

5g4s commented 1 year ago

Unified denoising for mask Query denoising in object detection has been shown effective [18, 37] to accelerate convergence and improve performance. It adds noises to ground truth boxes and labels and feeds them to the Transformer decoder as noised positional queries and content queries. The model is trained to reconstruct ground truth objects given their noised versions. we can treat boxes as a noised version of masks, and train the model to predict masks given boxes as a denoising task.

5g4s commented 1 year ago

Hybrid matching The two heads can predict a pair of boxes and masks that are inconsistent with each other. To address this issue, in addition to the original box and classification loss in bipartite matching, we add a mask prediction loss to encourage more accurate and consistent matching results for one query.

$$
\lambda_{c l s} \mathcal{L}_{\text {cls }}+\lambda_{\text {box }} \mathcal{L}_{\text {box }}+\lambda_{\text {mask }} \mathcal{L}_{\text {mask }}
$$

5g4s / paper

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation #32