Default PostProcess behavior is changed compared with DETR

Hi,

Sorry for the late reply. It is a very interesting question. We followed Deformable DETR's implementation at that time. As for your concern, we eveluate our model with different post-process methods:

Same as ours: (1) sigmoid normalization (2) sort on 300*91 predictions: 41.0 AP
Same as DETR: (1) softmax normalization (2) sort on 300 predictions (only select highest-score class for each query): 37.7AP
Something in between: (1) sigmoid normalization (2) sort on 300 predictions (only select highest-score class for each query): 39.9AP

From our understanding, the post-process is related to how you treat with un-matched predictions (aka non-object predictions). DETR assigns these predictions to an extra class (i.e., non-object), and performs softmax normalization on all classes. This results in: un-matched predictions' non-object score is much higher than other classes. While Conditional DETR (as well as Deformable DETR), uses binary sigmoid focal loss for classification. The un-matched predictions are assigned with a all-zero ground truth (score of each class is 0). So un-matched predictions should have very low score on all classes.

With the discussion above, if Conditional DETR uses DETR's post-process (including softmax normalization), the score of un-matched predictions will be raised, and accordingly, harm the final sorting (result in 37.7AP). On the contrary, DETR needs softmax normalization in post-process to suppress the scores of un-matched predictions.

Back to your question, we decouple the difference between our implementation and DETR's into two parts: normalization and whether only use the highest-score class. Comparison between 2. and 3. shows that using sigmoid normalization rather than softmax gives +2.2AP improvement. Not only using the highest-score class gives +1.1AP improvement on top of it (comparison between 1. and 3.). As for this reason, we have a guess: some queries might have good box regression quality, but mis-classify the object class. Implementation we used allows queries to have more than one prediction to reduce this type of classification error.

If you have more thoughts about this question, welcome to discuss.

Atten4Vis / ConditionalDETR

Default PostProcess behavior is changed compared with DETR #8