Question about shuffleBN

facebookresearch / moco

PyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722

MIT License

4.83k stars 794 forks source link

Question about shuffleBN #20

Closed zhang0jhon closed 4 years ago

zhang0jhon commented 4 years ago

Awesome work! In my opinion, shuffleBN is proposed to maintain the differences of running mean and variance between encoder q and encoder k, which prevents local optimal encoder parameters. How do you evaluate the benefits of shuffleBN? Moreover, distribute training of MoCo suffers from the time-consuming broadcast and allgather operations in shuffleBN. Do you have any suggestion for accelerating distribute training with shuffleBN?

KaimingHe commented 4 years ago

The latest version of the arXiv paper has the ablation curves of shuffle BN. Broadcast/AllGather only happens twice, on the data and on the output features. It is not layer-wise, so it is very affordable. The shuffling on the data could be instead implemented by a cleverer data loader and could be free. The shuffling on the feature is very lightweight as the feature size is small (128-d).

CoinCheung commented 4 years ago

Hi, What if we use sync-bn ? Could we use sync-bn instead of shuffle-bn ?