Closed mashijie1028 closed 1 year ago
Hi @msj17thu,
Thanks for your attention to our work. Regarding your question on the selection of the representation space, this is actually an important topic discussed in our paper, please refer to Sec. 3.2 (arXiv v2) or Sec. 3.3 (arXiv v1).
Regarding the dimension of the last layer of the projector (MLP part in the DINOHead), 256 is a common choice in the self-supervised learning community, and it also holds true for DINO, the paper that GCD follows. The last_layer
that contains 65536 components is a special design for prototypical contrastive learning, which should be dropped if one only intends to perform vanilla contrastive learning between images, as the text of GCD paper states, and we believe keeping the last_layer
is an unintended behaviour of the GCD authors. As for performance, the difference is quite negligible, and GCD's implementation might produce small gains though.
Okay, I figured it out! Thank you for your solid analysis and patient response. 😄 👍
Hi, thanks for your excellent work. I was wondering why implementing prototype classification and representation learning in two embedding spaces, i.e.,
bottleneck_dim
$\neq$in_dim
(if I didn't misunderstand the code), would it be helpful if representation learning and clustering are in the same embedding space?Also, in your case, the dim of contrastive (representation) learning is
256
(bottleneck_dim
in your code), while65536
(out_dim
in GCD code) in GCD paper, it is a huge difference, how does the projection dim influence the results?Thanks!