Closed zhang0jhon closed 4 years ago
The latest version of the arXiv paper has the ablation curves of shuffle BN. Broadcast/AllGather only happens twice, on the data and on the output features. It is not layer-wise, so it is very affordable. The shuffling on the data could be instead implemented by a cleverer data loader and could be free. The shuffling on the feature is very lightweight as the feature size is small (128-d).
Hi, What if we use sync-bn ? Could we use sync-bn instead of shuffle-bn ?
Awesome work! In my opinion, shuffleBN is proposed to maintain the differences of running mean and variance between encoder q and encoder k, which prevents local optimal encoder parameters. How do you evaluate the benefits of shuffleBN? Moreover, distribute training of MoCo suffers from the time-consuming broadcast and allgather operations in shuffleBN. Do you have any suggestion for accelerating distribute training with shuffleBN?