5g4s / paper

0 stars 0 forks source link

DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR #39

Open 5g4s opened 1 year ago

5g4s commented 1 year ago

https://arxiv.org/abs/2201.12329

5g4s commented 1 year ago

This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer.

5g4s commented 1 year ago

However, due to its ineffective design and use of queries, DETR suffers from significantly slow training convergence, usually requiring 500 epochs to achieve a good performance.

5g4s commented 1 year ago

image

5g4s commented 1 year ago

Two possible reasons in cross-attention account for the model’s slow training convergence:

1) it is hard to learn the queries due to the optimization challenge, ->we reuse the well-learned queries from DETR (keep them fixed) and only train the other modules. The training curves in Fig. 3(a) show that the fixed queries only slightly improve the convergence in very early epochs, e.g. the first 25 epochs. Hence the query learning (or optimization) is likely not the key concern. image

2) the positional information in the learned queries is not encoded in the same way as the sinusoidal positional encoding used for image features. ->Each query can be regarded as a positional prior to let decoders focus on a region of interest. Although they serve as a positional constraint, they also carry undesirable properties: multiple modes and nearly uniform attention weights. We conjecture that the multiple mode property of queries in DETR is likely the root cause for its slow training and we believe introducing explicit positional priors to constrain queries on a local region is desirable for training.

image

5g4s commented 1 year ago

The two attention maps at the top of Fig. 4(a) have two or more concentration centers, making it hard to locate objects when multiple objects exist in an image. The bottom maps of Fig. 4(a) focus on areas that are either too large or too small, and hence cannot inject useful positional information into the procedure of feature extraction.

Conditional DETR (Meng et al., 2021) uses explicit positional embedding as positional queries for training, yielding attention maps similar to Gaussian kernels as shown in Fig. 4(b). Although explicit positional priors lead to good performance in training, they ignore the scale information of an object. In contrast, our proposed DAB-DETR explicitly takes into account the object scale information to adaptively adjust attention weights, as shown in Fig. 4(c).

image

5g4s commented 1 year ago

Approach

We replace the query formulation in DETR with dynamic anchor boxes, which can enforce each query to focus on a specific area, and name this model DETR+DAB.

Result

The training curves in Fig. 3(b) show that DETR+DAB leads to a much better performance compared with DETR, in terms of both detection AP and training/testing loss. image