CVMI-Lab / SimGCD

(ICCV 2023) Parametric Classification for Generalized Category Discovery: A Baseline Study
https://arxiv.org/abs/2211.11727
MIT License
91 stars 13 forks source link

Question about the embedding space of contrastive learning and clustering? #4

Closed mashijie1028 closed 1 year ago

mashijie1028 commented 1 year ago

Hi, thanks for your excellent work. I was wondering why implementing prototype classification and representation learning in two embedding spaces, i.e., bottleneck_dim $\neq$ in_dim (if I didn't misunderstand the code), would it be helpful if representation learning and clustering are in the same embedding space?

Also, in your case, the dim of contrastive (representation) learning is 256 (bottleneck_dim in your code), while 65536 (out_dim in GCD code) in GCD paper, it is a huge difference, how does the projection dim influence the results?

Thanks!

xwen99 commented 1 year ago

Hi @msj17thu,

Thanks for your attention to our work. Regarding your question on the selection of the representation space, this is actually an important topic discussed in our paper, please refer to Sec. 3.2 (arXiv v2) or Sec. 3.3 (arXiv v1).

Regarding the dimension of the last layer of the projector (MLP part in the DINOHead), 256 is a common choice in the self-supervised learning community, and it also holds true for DINO, the paper that GCD follows. The last_layer that contains 65536 components is a special design for prototypical contrastive learning, which should be dropped if one only intends to perform vanilla contrastive learning between images, as the text of GCD paper states, and we believe keeping the last_layer is an unintended behaviour of the GCD authors. As for performance, the difference is quite negligible, and GCD's implementation might produce small gains though.

mashijie1028 commented 1 year ago

Okay, I figured it out! Thank you for your solid analysis and patient response. 😄 👍