About the prototypes - Githubissues

CVMI-Lab / SlotCon

(NeurIPS 2022) Self-Supervised Visual Representation Learning with Semantic Grouping

https://wen-xin.info/slotcon/

Apache License 2.0

95 stars 9 forks source link

About the prototypes #1

Closed DevinCheung closed 11 months ago

DevinCheung commented 2 years ago

Hi Xin Wen,

Thanks for your great work! Regarding SlotCon, I have two questions: (1) I notice the prototypes are initialized with nn.Embedding. I am wondering how to ensure that the trainable prototypes are optimized to be meaningful semantic groups via backpropagation. Since the loss functions do not explicitly ensure this, I am a little bit confused about the optimization of prototypes. (2) Have you tried how the resize operation in the data augmentation matters? I mean, if you only do crop along with other augmentation, without resize operation, will the performance drop heavily?

Thanks for your reply!

xwen99 commented 2 years ago

Hi @DevinCheung, thanks for your attention to our work.

Regarding your 1st question, you can check out sec. 5 (Discussion on the emergence of objectness) of our paper. To put it simply, this technique follows deep clustering, and the emergence of objectness can be viewed as a joint result of geometric covariance, photometric invariance, and compositionality priors.
Regarding the resize operation, dropping it solely is impractical as images within one batch should have the same shape. However, we did try to remove both cropping and resizing operations (see the last part of sec. C in the appendix), and the model failed to learn semantics, leading to a significant performance drop.

DevinCheung commented 2 years ago

Hi @xwen99 , thanks for your quick reply! For the second question, I may not put it quite clearly.

I mean, for example, I have two crops of one image. The two crops are of the same ratio corresponding to the raw image (i.e. no resize operation). Also the two crops have overlaps which are essential for calculating L_{Group}. Then I do some gaussian blur, color jitter, etc on the two crops as the two views to be input into the network. In this way, no RoI_Align operation is needed.

Briefly speaking, resize augmentation is removed, and the others are maintained. Will this cause a significant performance drop?

Thanks!

xwen99 commented 2 years ago

Hi @DevinCheung, regarding your current question, the last part of sec. C in the appendix covers precisely the same setting, and please have a look. Briefly speaking, yes, the performance will drop significantly.