PeizeSun / SparseR-CNN

[CVPR2021, PAMI2023] End-to-End Object Detection with Learnable Proposal
MIT License
1.32k stars 187 forks source link

Question regarding proposal feature #77

Open abeyang00 opened 3 years ago

abeyang00 commented 3 years ago

I have a question regarding proposal feature.

In DETR paper, reshaped feature map (HW x C) is given as input to transformer encoder to learn correlation between each pixels. However, in your paper, you use C size vector (named 'prop_feats') instead of reshaped feature map.

How does this C size vector learn the correlation among each pixels? In my understanding this does not contain the feature information for each pixel position.

I saw your reply in one of the previous issues where you replied 'don't understand dynamic head as Q,K,V'. How should i understand this concept then??

Thank you in advance!

PeizeSun commented 3 years ago

Hi~ The proposal features contains information about its corresponding object. The proposal feature updates itself by interacting with RoI feature. We don't need feature information for each pixel position.

abeyang00 commented 3 years ago

so roi feature can be regarded as Query and proposal features as Key?

PeizeSun commented 3 years ago

I guess Query is proposal features [100 x C], roi feature is Key [100 x (7 x 7 x C)] . Think about DETR, Query is object query [100 x C], Key is 100 times reshaped image feature map [100 x (HW x C)], where each (HW x C) is the same.

HYUNJS commented 3 years ago

@PeizeSun Isn't that Q and K must have the same hidden dimension to process matrix multiplication, like in DETR Q is [100 x C] and K is [HW x C] instead of [100 x (HWC)]?