facebookresearch / moco

PyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722
MIT License
4.83k stars 794 forks source link

Can group normalization (GN) be used to avoid cheating? #80

Open eezywu opened 3 years ago

eezywu commented 3 years ago

Can we replace the shuffleBN with GN to avoid cheating?

NeuZhangQiang commented 3 years ago

@eezywu, do you know why batch normalization would prevents the model from learning good representations? Actually, I don't understand this point. Moreover, does the shuffleBN only work for multi-GPU framework? If I have only one GPU, does shuffleBN work?

Mushtaqml commented 3 years ago

I am also trying to understand this point. Did anyone was able to find any solution?

shuuchen commented 3 years ago

It depends on the batch statistics as in the paper:

This ensures the batch statistics used to compute a query and its positive key come from two different subsets.

If the same mini-batch is input to q and k encoders, the output might be similar, because 1) the weights of k is updated according to q so their weights are getting similar; 2) augmentation effect is too weak to distinguish between them; etc.

Thus the mini-batch is randomly shuffled to change the input combinations so that input batch statistics and the output is more different. Finally, the output of k should be un-shuffled to match those of q for positive pairs.

If you only have one GPU, a choice is to use different combinations to make positive pairs for a sample.