chainer / chainermn

ChainerMN: Scalable distributed deep learning with Chainer
https://chainer.org
MIT License
207 stars 57 forks source link

Port Chainer#4191 or use Chainer's BN implementation #203

Open kuenishi opened 6 years ago

kuenishi commented 6 years ago

ChainerMN has mostly-copied BatchNormalization code (but several AllReduce added), which means potential bugs from Chainer could also be imported. https://github.com/chainer/chainer/pull/4191 could be one of them; porting it to ChainerMN seems obvious but we have another major choice, which is to think of cleaner porting from Chainer's BN code, to ride on a free lunch from Chainer. Thoughts?

mingxiaoh commented 6 years ago

@kuenishi we have the same question here. we recently tried multi-nodes test experiment, we found that googlenet_v2, googlenet_v3 and resnet50 show unexpected low(<10%) validation accuracy while alexnet and googlenet can achieve SOTA accuracy. We guess there might be bug in batch normalization implementation. FYI, above networks can achieve same accuracy as SOTA/GPU on single node.

kuenishi commented 6 years ago

@mingxiaoh Thank you for reporting. Do you have any chance trying to port https://github.com/chainer/chainer/pull/4191 to ChainerMN's BN code to verify it's a bug?