Question on feature alignment

weiyaowang commented 7 months ago

Thanks for this interesting work! I have a question regarding the feature space alignment. The paper uses the feature in ResNet50 before attention pooling layer, but this feature lives in a different embedding space than the feature after attention pooling. In particular, CLIP has a linear layer projecting the features from 2048 to 1024 in the attention pooling. It seems very strange that the text tokens are aligned with the features prior to pooling. Some experiments revealed that the pooled features are very different than the unpooled features, and simple interpolation won't align them. I wonder if authors have some thoughts on why similarities between visual and text can be directly computed this way on the visual features before pooling?

weiyaowang commented 7 months ago

Seems like you would still need to call the attnpool, but use the original patches themselves as queries instead of using the mean?

Adonis-galaxy commented 7 months ago

Hi. Thank you for your insightful question!

Indeed, sometimes the pooled feature may vary a lot from unpooled features. However, in the setting of CLIP, it seems that the similarity response is still preserved, since other methods regarding CLIP-based dense predication (i.e. DenseCLIP, PointCLIP v2) employ similar similarity calculation techniques and functions pretty well. One intuition is that CLIP is trained under an image classification pre-text task, which means that patches containing the salient object need to preserve its feature after pooling.

Let me know if you have further questions~

Adonis-galaxy / DepthCLIP

Question on feature alignment #5