Open 5g4s opened 1 year ago
We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages.
To address this issue, except for the Hungarian loss, our method additionally feeds GT bounding boxes with noises into the Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to faster convergence.
Much work has tried to identify the root cause and mitigate the slow convergence issue. Some of them address the problem by improving the model architecture.
For the same image, a query is often matched with different objects in different epochs, which makes optimization ambiguous and inconstant.
To address this problem, we propose a novel training method by introducing a query denoising task to help stabilize bipartite graph matching in the training process. For noised queries, we perform a denoising task to reconstruct their corresponding GT boxes. Our loss function consists of two components. One is a reconstruction loss and the other is a Hungarian loss which is the same as in other DETR-like methods.
Evaluate instability The larger the IS, the more unstable it is.
We consider adding noise to boxes in two ways: center shifting and box scaling. For label noising, we adopt label flipping, which means we randomly flip some GT labels to other labels.
The purpose of the attention mask is to prevent information leakage. There are two types of potential information leakage. One is that the matching part may see the noised GT objects and easily predict GT objects. The other is that one noised version of a GT object may see another version. Therefore, our attention mask is to make sure the matching part cannot see the denoising part and the denoising groups cannot see each other.
In the figure, the yellow, brown and green grids in the attention mask represent 0 (unblock) and grey grids represent 1 (block).
https://arxiv.org/abs/2203.01305