Distribution Efficiency is low on AWS GPU instances

sonots commented 6 years ago

I've evaluated distribution efficiency of ChainerMN on AWS GPU Instances.

The article is here https://qiita.com/sonots/private/22384bbc61284f2fdf94 (Japanese).

My evaluation told that:

Distribution efficiency is only 27% ~ 37% with multiple nodes of p2.16xlarge (which is an instance having 16 k80s)
Distribution efficiency is 85% with one node of p2.xlarge, thus, one p2.16xlarge vs three p2.16xlarge results in same speed.

I first thought that this result is as I expected because ChainerMN recommends to use Infiniband, but p2.16xlarge has only 20Gbps network bandwidth. However, ChainerMN used only 6.0Gbps during my experiment. So, network was not the bottleneck.

I investigated more, and it seemed that sys CPU usage is increased on the experiment of multi-nodes.

%Cpu0  : 40.4 us, 59.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 31.4 us,  0.0 sy,  0.0 ni, 67.9 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu3  : 36.3 us, 61.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  2.6 si,  0.0 st
%Cpu4  : 96.0 us,  0.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  3.6 si,  0.0 st
%Cpu9  : 99.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu10 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 : 27.5 us, 72.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 : 34.7 us, 65.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 : 32.5 us, 66.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu19 : 93.7 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  6.3 si,  0.0 st
%Cpu23 : 34.3 us, 65.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 : 30.8 us, 69.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu29 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu30 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu31 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu34 : 68.9 us,  0.0 sy,  0.0 ni, 31.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu40 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

So, the reason would be because kernel syscalls steal CPU of chainer process, probably for network traffic, and chainer process can not perform main task well. But, this is just my guess.

Do you guys have any reasons to explain this, and have any idea to improve performance on AWS?

keisukefukuda commented 6 years ago

Hello, @sonots san. Thank you for the experiments and your effort.

There would be several reasons of the poor parallel efficiency and I have a few questions about it.

Possible reasons:

In MNIST, the model is too small and distributed processing is not beneficial.
In DCGAN, the model is GAN and it's so complicated and the discussion about parallel efficiency is not that simple
MPI does busy-wait in blocking communication, so CPU busy rate does not help much to identify the bottleneck.

Questions:

What did iperf actually measured? I guess it measures an average bandwidth of 1 or 2 seconds by default. What are the values listed in the table? If it's the average value through a single experiment, then the value should be much lower than the actual bandwidth because ChainerMN adopts synchronous parallelization.
Did you specify the number of processes in the mpiexec command?
Which MPI implementation did you use?
What happens if you run 1 process/machine on x8 instances?

Thanks!

sonots commented 6 years ago

What did iperf actually measured?

The result of iperf was like:

------------------------------------------------------------
Client connecting to sonots-p2-8xlarge-2, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  7] local 10.0.4.58 port 47890 connected with 10.0.4.102 port 5001
[  3] local 10.0.4.58 port 47882 connected with 10.0.4.102 port 5001
[  6] local 10.0.4.58 port 47888 connected with 10.0.4.102 port 5001
[  4] local 10.0.4.58 port 47884 connected with 10.0.4.102 port 5001
[  5] local 10.0.4.58 port 47886 connected with 10.0.4.102 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec  1.50 GBytes  1.28 Gbits/sec
[  3]  0.0-10.0 sec  3.92 GBytes  3.37 Gbits/sec
[  6]  0.0-10.0 sec  3.92 GBytes  3.36 Gbits/sec
[  4]  0.0-10.0 sec  1.22 GBytes  1.05 Gbits/sec
[  5]  0.0-10.0 sec  1.20 GBytes  1.03 Gbits/sec
[SUM]  0.0-10.0 sec  11.8 GBytes  10.1 Gbits/sec

Did you specify the number of processes in the mpiexec command?

I specified number of processes in the hostfile. I of course checked the number of processes running with ps or top commands.

Which MPI implementation did you use?

OpenMPI v3.0.0

What happens if you run 1 process/machine on x8 instances?

The result was same with p2.xlarge.

sonots commented 6 years ago

I will re-evaluate with ImageNet anyway.

keisukefukuda commented 6 years ago

Thanks! (and after you finish a new experiment, please re-open this or create a new issue.)

sonots commented 6 years ago

I've re-evaluated and now it looked reasonable. https://qiita.com/sonots/items/22384bbc61284f2fdf94#%E3%81%BE%E3%81%A8%E3%82%81

keisukefukuda commented 6 years ago

Thanks!

chainer / chainermn

Distribution Efficiency is low on AWS GPU instances #131