Open kuenishi opened 6 years ago
@kuenishi we have the same question here. we recently tried multi-nodes test experiment, we found that googlenet_v2, googlenet_v3 and resnet50 show unexpected low(<10%) validation accuracy while alexnet and googlenet can achieve SOTA accuracy. We guess there might be bug in batch normalization implementation. FYI, above networks can achieve same accuracy as SOTA/GPU on single node.
@mingxiaoh Thank you for reporting. Do you have any chance trying to port https://github.com/chainer/chainer/pull/4191 to ChainerMN's BN code to verify it's a bug?
ChainerMN has mostly-copied BatchNormalization code (but several AllReduce added), which means potential bugs from Chainer could also be imported. https://github.com/chainer/chainer/pull/4191 could be one of them; porting it to ChainerMN seems obvious but we have another major choice, which is to think of cleaner porting from Chainer's BN code, to ride on a free lunch from Chainer. Thoughts?