Atten4Vis / ConditionalDETR

This repository is an official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence". (https://arxiv.org/abs/2108.06152)
Apache License 2.0
358 stars 48 forks source link

Default PostProcess behavior is changed compared with DETR #8

Closed YuyaoXiaoCS closed 2 years ago

YuyaoXiaoCS commented 2 years ago

Hi Author,

In your code: prob = out_logits.sigmoid() topk_values, topk_indexes = torch.topk(prob.view(out_logits.shape[0], -1), 100, dim=1) scores = topk_values topk_boxes = topk_indexes // out_logits.shape[2] labels = topk_indexes % out_logits.shape[2] In DETR: prob = F.softmax(out_logits, -1) scores, labels = prob[..., :-1].max(-1)

This behavior is a little bit strange that you select the top 100 scores within 300 object query and its classes. Which means that one object query is allowed to selected twice with different class. It would be really appreciated if you could give any explanation!

Thanks!

DeppMeng commented 2 years ago

Hi,

Sorry for the late reply. It is a very interesting question. We followed Deformable DETR's implementation at that time. As for your concern, we eveluate our model with different post-process methods:

  1. Same as ours: (1) sigmoid normalization (2) sort on 300*91 predictions: 41.0 AP
  2. Same as DETR: (1) softmax normalization (2) sort on 300 predictions (only select highest-score class for each query): 37.7AP
  3. Something in between: (1) sigmoid normalization (2) sort on 300 predictions (only select highest-score class for each query): 39.9AP

From our understanding, the post-process is related to how you treat with un-matched predictions (aka non-object predictions). DETR assigns these predictions to an extra class (i.e., non-object), and performs softmax normalization on all classes. This results in: un-matched predictions' non-object score is much higher than other classes. While Conditional DETR (as well as Deformable DETR), uses binary sigmoid focal loss for classification. The un-matched predictions are assigned with a all-zero ground truth (score of each class is 0). So un-matched predictions should have very low score on all classes.

With the discussion above, if Conditional DETR uses DETR's post-process (including softmax normalization), the score of un-matched predictions will be raised, and accordingly, harm the final sorting (result in 37.7AP). On the contrary, DETR needs softmax normalization in post-process to suppress the scores of un-matched predictions.

Back to your question, we decouple the difference between our implementation and DETR's into two parts: normalization and whether only use the highest-score class. Comparison between 2. and 3. shows that using sigmoid normalization rather than softmax gives +2.2AP improvement. Not only using the highest-score class gives +1.1AP improvement on top of it (comparison between 1. and 3.). As for this reason, we have a guess: some queries might have good box regression quality, but mis-classify the object class. Implementation we used allows queries to have more than one prediction to reduce this type of classification error.

If you have more thoughts about this question, welcome to discuss.