Closed fengyuan14 closed 6 years ago
@arthuryuan1987 , thanks you for raising an issue, and thank you for your contribution!
It's great to hear that you implemented a layer-wise computation-communication overlap feature and obtained a significant performance improvement.
Actually, the feature is in our scope, but we have no concrete short-term plan. This is because the overlapping feature is not straightforward as it looks.
if
statement). The current ChainerMN pads zero for all un-evaluated layers and works well.As you may know, ChainerMN already has a "double-buffering" feature. Although it comes with a cost of accuracy degradation, it is general and can work with any network.
In fact, ChainerMN is integrated into Chainer (see https://github.com/chainer/chainer/pull/5226), and will be more tightly integrated with Chainer. There may be more opportunities to support such a corner cases with more advanced Chainer features and we will start considering the feature.
Thanks,
Thanks for your elaborate explanation. We take deep dive for a completed view. Thanks a lot.
I'm closing the issue for now. Feel free to reopen it if you like to have more discussion on this. Thanks!
Hello, we took some experiments about Non-Blocking methodology on ChainerMN. The methodology simply is like,
<Iter-(n)>
(Layer-1) wait for async-allreduce from last iter ->
(Layer-1) forward computation ->
(Layer-2) wait for async-allreduce from last iter ->
(Layer-2) forward computation ->
... ... ->
(Layer-2) backward computation ->
(Layer-2) send async-allreduce request ->
(Layer-1) backward computation ->
(Layer-1) send async-allreduce request ->
<Iter-(n+1)>
(Layer-1) wait for async-allreduce from last iter ->
(Layer-1) forward computation ->
(Layer-2) wait for async-allreduce from last iter ->
(Layer-2) forward computation ->
...
Compared with Blocking one (existing methodology), we got a significant improvement on Non-Blocking methodology. Here is data on Resnet50, Test Environment: 16 nodes (Intel skx-8180), 128 batch size, 10GB bandwidth, IMPI. Blocking scalability is 66.72%. Non-Blocking scalability is 92.4%. Scalability calculation: iterations-per-sec-on-MultiNode / iterations-per-sec-on-SingleNode.
Have you got any plans to implement Non-Blocking scalability? Or we can show a patch and discuss more of it ?