DN-DETR: Accelerate DETR Training by Introducing Query DeNoising

5g4s / paper

0 stars 0 forks source link

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising #38

Open 5g4s opened 1 year ago

5g4s commented 1 year ago

https://arxiv.org/abs/2203.01305

5g4s commented 1 year ago

We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages.

5g4s commented 1 year ago

To address this issue, except for the Hungarian loss, our method additionally feeds GT bounding boxes with noises into the Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to faster convergence.

5g4s commented 1 year ago

Much work has tried to identify the root cause and mitigate the slow convergence issue. Some of them address the problem by improving the model architecture.

5g4s commented 1 year ago

Problem

For the same image, a query is often matched with different objects in different epochs, which makes optimization ambiguous and inconstant.

Approach

To address this problem, we propose a novel training method by introducing a query denoising task to help stabilize bipartite graph matching in the training process. For noised queries, we perform a denoising task to reconstruct their corresponding GT boxes. Our loss function consists of two components. One is a reconstruction loss and the other is a Hungarian loss which is the same as in other DETR-like methods.

5g4s commented 1 year ago

Evaluate instability The larger the IS, the more unstable it is.

5g4s commented 1 year ago

Denoising

We consider adding noise to boxes in two ways: center shifting and box scaling. For label noising, we adopt label flipping, which means we randomly flip some GT labels to other labels.

5g4s commented 1 year ago

The purpose of the attention mask is to prevent information leakage. There are two types of potential information leakage. One is that the matching part may see the noised GT objects and easily predict GT objects. The other is that one noised version of a GT object may see another version. Therefore, our attention mask is to make sure the matching part cannot see the denoising part and the denoising groups cannot see each other.

In the figure, the yellow, brown and green grids in the attention mask represent 0 (unblock) and grey grids represent 1 (block).