Open eezywu opened 3 years ago
@eezywu, do you know why batch normalization would prevents the model from learning good representations? Actually, I don't understand this point. Moreover, does the shuffleBN only work for multi-GPU framework? If I have only one GPU, does shuffleBN work?
I am also trying to understand this point. Did anyone was able to find any solution?
It depends on the batch statistics as in the paper:
This ensures the batch statistics used to compute a query and its positive key come from two different subsets.
If the same mini-batch is input to q and k encoders, the output might be similar, because 1) the weights of k is updated according to q so their weights are getting similar; 2) augmentation effect is too weak to distinguish between them; etc.
Thus the mini-batch is randomly shuffled to change the input combinations so that input batch statistics and the output is more different. Finally, the output of k should be un-shuffled to match those of q for positive pairs.
If you only have one GPU, a choice is to use different combinations to make positive pairs for a sample.
Can we replace the shuffleBN with GN to avoid cheating?