ZhangGongjie / SAM-DETR

[CVPR'2022] SAM-DETR & SAM-DETR++: Official PyTorch Implementation
MIT License
292 stars 50 forks source link

The question about emb_dim in cross_attention module #7

Open Bo396543018 opened 1 year ago

Bo396543018 commented 1 year ago

Hi, I found that compared to other DETR variants, the q and k dimensions in SAM cross-attention use SPx8 to be higher. I would like to ask if it is fairer to compare with SPx1.

ZhangGongjie commented 1 year ago

Thanks for pointing this out.

In my experience, even if we add an additional Linear layer to reduce the feature dimension, SPx8 still outperforms SPx1. But that includes additional components, so we choose the design described in our paper and the code implementation, which also has superior performance.

Note that we include #Params and GFLOPs when compared with other DETR variants in our paper. Higher q and k dimensions bring both higher AP and higher #Params and GFLOPs.

Bo396543018 commented 1 year ago

Thank you for your answer, there is another question I would like to ask, in SAM, why need to use two ROI operations to get q_content and q_content_point respectively.

ZhangGongjie commented 1 year ago

I checked the codes. It turned out that they are redundant. One ROI operation is enough.