facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.24k stars 906 forks source link

Role of centering in preventing collapse #101

Open pmgautam opened 3 years ago

pmgautam commented 3 years ago

I am not able to interpret the statement centering prevents one dimension to dominate but encourages collapse to the uniform distribution. Since we are subtracting the number and doing softmax, the distribution remains the same which is similar to stabilizing the softmax function. Can someone help me as I feel I am missing something here. Thanks in advance!

woctezuma commented 3 years ago

For information for others, this is the paragraph which you are referring to:

Article

mathildecaron31 commented 2 years ago

Hi @pmgautam

You are right that if you subtract a number to logit before applying softmax does not change the distribution (it is the classical trick to stabilize softmax): exp(t_i - c) / sum_j (exp(t_j - c)) = exp(t_i) / sum_j (exp(t_j))

However, here the center is a vector and not a scalar and so the operation we are doing is: exp(t_i - c_i) / sum_j (exp(t_j - c_j))

Hope that helps to clarify the centering operation.

anniepank commented 2 years ago

Hi @mathildecaron31,

but even if one has a multidimensional distribution and changes each dimension by subtracting some number and doing softmax, the overall distribution will still remain the same. So I am still don't see how it would become a uniform distribution.

ratthachat commented 9 months ago

Is this empirical concept rather than theoretical concept, and that is why there is no clear logical explanation here?

Edit the closet explanation i could find is : BYOL needs batch normalization in place of negative contrasting:

https://imbue.com/research/2020-08-24-understanding-self-supervised-contrastive-learning/

And the centering technique used in this paper is directly simplification of what is essential in batchnorm