Non-Blocking Methodology on ChainerMN

fengyuan14 commented 6 years ago

Hello, we took some experiments about Non-Blocking methodology on ChainerMN. The methodology simply is like, <Iter-(n)> (Layer-1) wait for async-allreduce from last iter -> (Layer-1) forward computation -> (Layer-2) wait for async-allreduce from last iter -> (Layer-2) forward computation -> ... ... -> (Layer-2) backward computation -> (Layer-2) send async-allreduce request -> (Layer-1) backward computation -> (Layer-1) send async-allreduce request -> <Iter-(n+1)> (Layer-1) wait for async-allreduce from last iter -> (Layer-1) forward computation -> (Layer-2) wait for async-allreduce from last iter -> (Layer-2) forward computation -> ...

Compared with Blocking one (existing methodology), we got a significant improvement on Non-Blocking methodology. Here is data on Resnet50, Test Environment: 16 nodes (Intel skx-8180), 128 batch size, 10GB bandwidth, IMPI. Blocking scalability is 66.72%. Non-Blocking scalability is 92.4%. Scalability calculation: iterations-per-sec-on-MultiNode / iterations-per-sec-on-SingleNode.

Have you got any plans to implement Non-Blocking scalability? Or we can show a patch and discuss more of it ?

keisukefukuda commented 6 years ago

@arthuryuan1987 , thanks you for raising an issue, and thank you for your contribution!

It's great to hear that you implemented a layer-wise computation-communication overlap feature and obtained a significant performance improvement.

Actually, the feature is in our scope, but we have no concrete short-term plan. This is because the overlapping feature is not straightforward as it looks.

What should the API look like?
How should it work with recurrent network such as LSTM?
How should it treat a dynamic network? i.e. a layer may not be computed depending on a dynamic condition (if statement). The current ChainerMN pads zero for all un-evaluated layers and works well.

As you may know, ChainerMN already has a "double-buffering" feature. Although it comes with a cost of accuracy degradation, it is general and can work with any network.

In fact, ChainerMN is integrated into Chainer (see https://github.com/chainer/chainer/pull/5226), and will be more tightly integrated with Chainer. There may be more opportunities to support such a corner cases with more advanced Chainer features and we will start considering the feature.

Thanks,

fengyuan14 commented 6 years ago

Thanks for your elaborate explanation. We take deep dive for a completed view. Thanks a lot.

keisukefukuda commented 6 years ago

I'm closing the issue for now. Feel free to reopen it if you like to have more discussion on this. Thanks!

chainer / chainermn

Non-Blocking Methodology on ChainerMN #291