Question of design of self attention mask

owen24819 commented 1 year ago

Hi,

Really nice job on the paper. I was excited to read it.

I was wondering if you could explain a bit further on the attention masks. FYI, I am referencing your Hybrid branch which I believe was used in the rest of the paper. So if I understood the paper / your code, the attention masks are just being used to prevent information leakage between the two groups (one-to-one and one-to-many) which makes sense.

However, I don't understand why you did not want to prevent information leakage between every query within the one-to-many group. I understand that you repeat the ground truth K times so multiple queries can match to the same object. However, I would think that the self attention perfromed in the decoder for the one-to-many group would naturally prevent multiple queries from selecting the same object since the whole point of self attention here is to remove duplicates. If you were to add attention mask here, I would think that would resolve this issue.

I think I may be fundamentally misunderstanding something as clearly this worked for you. Any insight would be appreicated. I linked the code below that I have looked at.

Thanks, Owen

https://github.com/HDETR/H-Deformable-DETR/blob/5dea6f4436969fcded8f018b4797753c5b1c0a81/models/deformable_detr.py#L208-L217

https://github.com/HDETR/H-Deformable-DETR/blob/5dea6f4436969fcded8f018b4797753c5b1c0a81/models/deformable_transformer.py#L473-L480

PkuRainBow commented 1 year ago

Really good points!

In fact, the interactions between the queries within the one-to-many branch are essential for two-stage Deformable DETR. The main reason is that "the bounding box predictions associated with the top 300th ~ 1500th queries are of much lower quality compared to the bounding box predictions associated with the top 300 queries." For example, we observe the AP might drop by more than 5 points if we replace the top 300 queries with the top 300th ~ 600th queries during both training and evaluation. Therefore, the interactions between these queries essentially boost their localization capabilities as multiple queries are used to match the exact same ground-truth box.

Besides, we empirically find that "prevent information leakage between every query within the one-to-many group" have no influence on one-stage Deformable DETR but hurts a lot for two-stage Deformable DETR. We also observe such conclusions hold for other vision tasks and the key reason is about whether to use a two-stage scheme.

A small hint is that the self-attention within the one-to-many branch is mainly for boosting the localization except for performing NMS that suppressed the duplicates presented along the one-to-many branch. In fact, the DN-DETR/DINO-DETR really depends on the careful design that uses the default queries along the one-to-one branch as the KEY-VALUE space for all the DN/CDN queries, which is essentially unnecessary in our design.

owen24819 commented 1 year ago

Thank you for the detailed response! I did not realize the top 300 queries / 300-1500 queries were coming from the two-stage. This makes more sense but I still can't quite wrap my head around this idea. Maybe I just need more time to think about it. Does the model know that the 300-1500 queries are "bad queries" therefore it knows it can disregard the NMS whereas the top 300 queries are "good queries" therefore it knows it has to use NMS?

But as follow up question, I would think you would just be interested in matching and calculating the loss for the pred_boxes and ignoring the pred_logits for the one-to-many branch. This way, you still get localization but you don't interfere with the natural NMS of the self attention.

PkuRainBow commented 1 year ago

"Does the model know that the 300-1500 queries are "bad queries" therefore it knows it can disregard the NMS whereas the top 300 queries are "good queries" therefore it knows it has to use NMS?"

We used to try to use independent self-attention for the one-to-one branch and the one-to-many branch and observe slightly better performance. The one-to-many branch requires less on NMS but it also needs NMS to suppress the additional duplicates beyond the k-groups of matched ones.

In fact, we are studying how to decompose self-attention from the DETR in our future work.

"....ignoring the pred_logits for the one-to-many branch. "

This might be a good point. We are happy to see you sharing more improved results based on HDETR.

owen24819 commented 1 year ago

Thanks for the response. That's exciting to hear you are continuing to work on HDETR.

Will definitely be sharing if I find improved results.

HDETR / H-Deformable-DETR

Question of design of self attention mask #13