5g4s / paper

0 stars 0 forks source link

DINO: DETR WITH IMPROVED DENOISING ANCHOR BOXES FOR END-TO-END OBJECT DETECTION #33

Open 5g4s opened 1 year ago

5g4s commented 1 year ago

https://arxiv.org/abs/2203.03605

5g4s commented 1 year ago

image

5g4s commented 1 year ago

In contrast to classical detection algorithms, DETR [3] is a novel Transformer based detection algorithm. It eliminates the need of hand-designed components and achieves comparable performance with optimized classical detectors like Faster RCNN [31].

5g4s commented 1 year ago

Despite its promising performance, the training convergence of DETR is slow and the meaning of queries is unclear. To address such problems, many methods have been proposed.

5g4s commented 1 year ago

Problem 1) Previous DETR-like models are inferior to the improved classical detectors. 2) The scalability of DETR-like models has not been well studied. There is no reported result about how DETR-like models perform when scaling to a large backbone and a large-scale data set.

5g4s commented 1 year ago

we add ground truth labels and boxes with noises into the Transformer decoder layers to help stabilize bipartite matching during training.

5g4s commented 1 year ago

image

5g4s commented 1 year ago

We also adopt deformable attention [41] for its computational efficiency.

5g4s commented 1 year ago

We propose three new methods as follows.

First, to improve the one-to-one matching, we propose a contrastive denoising training by adding both positive and negative samples of the same ground truth at the same time. After adding two different noises to the same ground truth box, we mark the box with a smaller noise as positive and the other as negative. The contrastive denoising training helps the model to avoid duplicate outputs of the same target.

Second, the dynamic anchor box formulation of queries links DETR-like models with classical two-stage models. Hence we propose a mixed query selection method, which helps better initialize the queries. We select initial anchor boxes as positional queries from the output of the encoder.

Third, to leverage the refined box information from later layers to help optimize the parameters of their adjacent early layers, we propose a new look forward twice scheme to correct the updated parameters with gradients from later layers.

5g4s commented 1 year ago

The reference point concept makes it possible to develop several techniques to further improve the DETR performance. The first technique is query selection, which selects features and reference boxes from the encoder as inputs to the decoder directly. The second technique is iterative bounding box refinement with a careful gradient detachment design between two decoder layers.

5g4s commented 1 year ago

Contrastive DeNoising Training

DN-DETR is very effective in stabilizing training and accelerating convergence.

With the help of DN queries, it learns to make predictions based on anchors which have GT boxes nearby.

Problem

However, it lacks a capability of predicting “no object” for anchors with no object nearby.

Purpose

To address this issue, we propose a Contrastive DeNoising (CDN) approach to rejecting useless anchors.

we generate two types of CDN queries: positive queries and negative queries. Positive queries within the inner square have a noise scale smaller than λ1 and are expected to reconstruct their corresponding ground truth boxes. Negative queries between the inner and outer squares have a noise scale larger than λ1 and smaller than λ2. They are expected to predict “no object”. image

5g4s commented 1 year ago

The confusion happens when multiple anchors are close to one object. In this case, it is hard for the model to decide which anchor to choose.

The confusion may lead to two problems. The first is duplicate predictions. With CDN queries, our model can distinguish the slight difference between anchors and avoid duplicate predictions. The second problem is that an unwanted anchor farther from a GT box might be selected. CDN further improves this capability by teaching the model to reject farther anchors.

5g4s commented 1 year ago

We only initialize anchor boxes using the position information associated with the selected top-K features, but leave the content queries static as before. Our mixed query selection approach only enhances the positional queries with top-K selected features and keeps the content queries learnable as before. image