facebookresearch / moco

PyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722
MIT License
4.79k stars 791 forks source link

what information is leaked due to intra-batch communication? #137

Open hackeryhw opened 1 year ago

hackeryhw commented 1 year ago

Hi,

I was unable to understand that what type of information is leaked due to intra-batch communication. Please can someone help in understanding this point or refer to some source?

The authors mentioned in the paper that:

Shuffling BN. Our encoders fq and fk both have Batch Normalization (BN) [37] as in the standard ResNet [33]. In experiments, we found that using BN prevents the model from learning good representations, as similarly reported in [35] (which avoids using BN). The model appears to “cheat” the pretext task and easily finds a low-loss solution. This is possibly because the intra-batch communication among samples (caused by BN) leaks information. We resolve this problem by shuffling BN. We train with multiple GPUs and perform BN on the samples independently for each GPU (as done in common practice). For the key encoder fk, we shuffle the sample order in the current mini-batch before distributing it among GPUs (and shuffle back after encoding); the sample order of the mini-batch for the query encoder fq is not altered. This ensures the batch statistics used to compute a query and its positive key come from two different subsets. This effectively tackles the cheating issue and allows training to benefit from BN. We use shuffled BN in both our method and its end-to end ablation counterpart (Figure 2a). It is irrelevant to the memory bank counterpart (Figure 2b), which does not suffer from this issue because the positive keys are from different mini-batches in the past.

solauky commented 1 year ago

来信收到,谢谢!

EricJin2002 commented 1 year ago

I found an answer here and hope it may help

[D] Shuffling Batch Normalization in MoCo - Self Supervised Learning Method : r/MachineLearning (reddit.com)