chainer / chainermn

ChainerMN: Scalable distributed deep learning with Chainer
https://chainer.org
MIT License
207 stars 57 forks source link

Distribution Efficiency is low on AWS GPU instances #131

Closed sonots closed 6 years ago

sonots commented 6 years ago

I've evaluated distribution efficiency of ChainerMN on AWS GPU Instances.

The article is here https://qiita.com/sonots/private/22384bbc61284f2fdf94 (Japanese).

My evaluation told that:

I first thought that this result is as I expected because ChainerMN recommends to use Infiniband, but p2.16xlarge has only 20Gbps network bandwidth. However, ChainerMN used only 6.0Gbps during my experiment. So, network was not the bottleneck.

I investigated more, and it seemed that sys CPU usage is increased on the experiment of multi-nodes.

%Cpu0  : 40.4 us, 59.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 31.4 us,  0.0 sy,  0.0 ni, 67.9 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu3  : 36.3 us, 61.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  2.6 si,  0.0 st
%Cpu4  : 96.0 us,  0.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  3.6 si,  0.0 st
%Cpu9  : 99.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu10 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 : 27.5 us, 72.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 : 34.7 us, 65.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 : 32.5 us, 66.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu19 : 93.7 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  6.3 si,  0.0 st
%Cpu23 : 34.3 us, 65.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 : 30.8 us, 69.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu29 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu30 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu31 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu34 : 68.9 us,  0.0 sy,  0.0 ni, 31.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu40 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

So, the reason would be because kernel syscalls steal CPU of chainer process, probably for network traffic, and chainer process can not perform main task well. But, this is just my guess.

Do you guys have any reasons to explain this, and have any idea to improve performance on AWS?

keisukefukuda commented 6 years ago

Hello, @sonots san. Thank you for the experiments and your effort.

There would be several reasons of the poor parallel efficiency and I have a few questions about it.

Possible reasons:

Questions:

Thanks!

sonots commented 6 years ago

What did iperf actually measured?

The result of iperf was like:

------------------------------------------------------------
Client connecting to sonots-p2-8xlarge-2, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  7] local 10.0.4.58 port 47890 connected with 10.0.4.102 port 5001
[  3] local 10.0.4.58 port 47882 connected with 10.0.4.102 port 5001
[  6] local 10.0.4.58 port 47888 connected with 10.0.4.102 port 5001
[  4] local 10.0.4.58 port 47884 connected with 10.0.4.102 port 5001
[  5] local 10.0.4.58 port 47886 connected with 10.0.4.102 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec  1.50 GBytes  1.28 Gbits/sec
[  3]  0.0-10.0 sec  3.92 GBytes  3.37 Gbits/sec
[  6]  0.0-10.0 sec  3.92 GBytes  3.36 Gbits/sec
[  4]  0.0-10.0 sec  1.22 GBytes  1.05 Gbits/sec
[  5]  0.0-10.0 sec  1.20 GBytes  1.03 Gbits/sec
[SUM]  0.0-10.0 sec  11.8 GBytes  10.1 Gbits/sec

Did you specify the number of processes in the mpiexec command?

I specified number of processes in the hostfile. I of course checked the number of processes running with ps or top commands.

Which MPI implementation did you use?

OpenMPI v3.0.0

What happens if you run 1 process/machine on x8 instances?

The result was same with p2.xlarge.

sonots commented 6 years ago

I will re-evaluate with ImageNet anyway.

keisukefukuda commented 6 years ago

Thanks! (and after you finish a new experiment, please re-open this or create a new issue.)

sonots commented 6 years ago

I've re-evaluated and now it looked reasonable. https://qiita.com/sonots/items/22384bbc61284f2fdf94#%E3%81%BE%E3%81%A8%E3%82%81

keisukefukuda commented 6 years ago

Thanks!