Open pmgautam opened 3 years ago
For information for others, this is the paragraph which you are referring to:
Hi @pmgautam
You are right that if you subtract a number to logit before applying softmax does not change the distribution (it is the classical trick to stabilize softmax): exp(t_i - c) / sum_j (exp(t_j - c)) = exp(t_i) / sum_j (exp(t_j))
However, here the center is a vector and not a scalar and so the operation we are doing is: exp(t_i - c_i) / sum_j (exp(t_j - c_j))
Hope that helps to clarify the centering operation.
Hi @mathildecaron31,
but even if one has a multidimensional distribution and changes each dimension by subtracting some number and doing softmax, the overall distribution will still remain the same. So I am still don't see how it would become a uniform distribution.
Is this empirical concept rather than theoretical concept, and that is why there is no clear logical explanation here?
Edit the closet explanation i could find is : BYOL needs batch normalization in place of negative contrasting:
https://imbue.com/research/2020-08-24-understanding-self-supervised-contrastive-learning/
And the centering technique used in this paper is directly simplification of what is essential in batchnorm
I am not able to interpret the statement centering prevents one dimension to dominate but encourages collapse to the uniform distribution. Since we are subtracting the number and doing softmax, the distribution remains the same which is similar to stabilizing the softmax function. Can someone help me as I feel I am missing something here. Thanks in advance!