Open 5g4s opened 1 year ago
In this paper, we have presented Mask DINO as a unified Transformer-based framework for both object detection and image segmentation. Mask DINO shows that detection and segmentation can help each other in query-based models.
DETR adopts a set-prediction objective and eliminates hand-crafted modules such as anchor design and non-maximum suppression.
Although DETR addresses both the object detection and panoptic segmentation tasks, its segmentation performance is still inferior to classical segmentation models.
In Transformer-based models, the best-performing detection and segmentation models are still not unified, which prevents task and data cooperation between detection and segmentation tasks.
It naturally leads to two questions: 1) why cannot detection and segmentation tasks help each other in Transformer-based models? and 2) is it possible to develop a unified architecture to replace specialized ones?
First, we propose a unified and enhanced query selection. Second, we propose a unified denoising training for masks to accelerate segmentation training. Third, we use a hybrid bipartite matching for more accurate and consistent matching from ground truth to both boxes and masks.
To summarize, our contributions are three-fold. 1) We develop a unified Transformer based framework for both object detection and segmentation. 2) We demonstrate that detection and segmentation can help each other through a shared architecture design and training method. 3) We also show that, via a unified framework, segmentation can benefit from detection pre-training on a large-scale detection dataset.
Mask DINO adds another branch for mask prediction and minimally extends several key components in detection to fit segmentation tasks.
Why cannot Mask2Former do detection well? First, its queries follow the design in DETR [1] without being able to utilize better positional priors as studied in Conditional DETR [26], Anchor DETR [34], and DABDETR [22].
Second, Mask2Former adopts masked attention (multi-head attention with attention mask) in Transformer decoders. The attention masks predicted from a previous layer are of high resolution and used as hard-constraints for attention computation.
Third, Mask2Former cannot explicitly perform box refinement layer by layer.
segmentation branch
Unified query selection for mask Query selection has been widely used in traditional two-stage models [28] and many DETR-like models [37, 40] to improve detection performance.
Unified denoising for mask Query denoising in object detection has been shown effective [18, 37] to accelerate convergence and improve performance. It adds noises to ground truth boxes and labels and feeds them to the Transformer decoder as noised positional queries and content queries. The model is trained to reconstruct ground truth objects given their noised versions. we can treat boxes as a noised version of masks, and train the model to predict masks given boxes as a denoising task.
Hybrid matching The two heads can predict a pair of boxes and masks that are inconsistent with each other. To address this issue, in addition to the original box and classification loss in bipartite matching, we add a mask prediction loss to encourage more accurate and consistent matching results for one query.
$$
\lambda_{c l s} \mathcal{L}_{\text {cls }}+\lambda_{\text {box }} \mathcal{L}_{\text {box }}+\lambda_{\text {mask }} \mathcal{L}_{\text {mask }}
$$
https://arxiv.org/abs/2206.02777