ChainerMN's ImageNet example is slower than Chainer's data parallel

LWisteria commented 6 years ago

You might know this already, recently I tried ChainerMN on Sakura Koukaryoku Computing.

I measured processing throughput by ImageNet example and compared ChainerMN's train_imagenet.py to Chainer's train_imagenet_data_parallel.py

// Chainer
$ python train_imagenet_data_parallel.py /opt/traindata/ILSVRC2012/train.ssv /opt/traindata/ILSVRC2012/val.ssv -a resnet50

// ChainerMN
$ mpiexec -n 4 python train_imagenet.py /opt/traindata/ILSVRC2012/train.ssv /opt/traindata/ILSVRC2012/val.ssv -a resnet50

Other detailed environment settings are written on my blog post (sorry for in Japanese).

The result showed the ChainerMN's was slower than Chainer's. result

What happened and can I improve ChainerMN's performance?

Please ask me if you have any questions and request me if you want to get the same ImageNet images to reproduce this problem

iwiwi commented 6 years ago

Thank you for reporting this! I personally don't think this is generally the case; For example, in our recent experiments (https://arxiv.org/abs/1711.04325), our throughput on ResNet50 with ChainerMN was kind of state-of-the-art in comparison with other efficient frameworks such as Caffe 2. I assume that your result is because of environment of configuration. Anyway, @shu65 will investigate on it soon.

hgjung3 commented 6 years ago

@iwiwi Has this problem been solved?

chainer / chainermn

ChainerMN's ImageNet example is slower than Chainer's data parallel #145