Closed sonots closed 6 years ago
Hello, @sonots san. Thank you for the experiments and your effort.
There would be several reasons of the poor parallel efficiency and I have a few questions about it.
Possible reasons:
Questions:
iperf
actually measured? I guess it measures an average bandwidth of 1 or 2 seconds by default. What are the values listed in the table? If it's the average value through a single experiment, then the value should be much lower than the actual bandwidth because ChainerMN adopts synchronous parallelization.mpiexec
command?Thanks!
What did iperf actually measured?
The result of iperf was like:
------------------------------------------------------------
Client connecting to sonots-p2-8xlarge-2, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 7] local 10.0.4.58 port 47890 connected with 10.0.4.102 port 5001
[ 3] local 10.0.4.58 port 47882 connected with 10.0.4.102 port 5001
[ 6] local 10.0.4.58 port 47888 connected with 10.0.4.102 port 5001
[ 4] local 10.0.4.58 port 47884 connected with 10.0.4.102 port 5001
[ 5] local 10.0.4.58 port 47886 connected with 10.0.4.102 port 5001
[ ID] Interval Transfer Bandwidth
[ 7] 0.0-10.0 sec 1.50 GBytes 1.28 Gbits/sec
[ 3] 0.0-10.0 sec 3.92 GBytes 3.37 Gbits/sec
[ 6] 0.0-10.0 sec 3.92 GBytes 3.36 Gbits/sec
[ 4] 0.0-10.0 sec 1.22 GBytes 1.05 Gbits/sec
[ 5] 0.0-10.0 sec 1.20 GBytes 1.03 Gbits/sec
[SUM] 0.0-10.0 sec 11.8 GBytes 10.1 Gbits/sec
Did you specify the number of processes in the mpiexec command?
I specified number of processes in the hostfile. I of course checked the number of processes running with ps
or top
commands.
Which MPI implementation did you use?
OpenMPI v3.0.0
What happens if you run 1 process/machine on x8 instances?
The result was same with p2.xlarge.
I will re-evaluate with ImageNet anyway.
Thanks! (and after you finish a new experiment, please re-open this or create a new issue.)
I've re-evaluated and now it looked reasonable. https://qiita.com/sonots/items/22384bbc61284f2fdf94#%E3%81%BE%E3%81%A8%E3%82%81
Thanks!
I've evaluated distribution efficiency of ChainerMN on AWS GPU Instances.
The article is here https://qiita.com/sonots/private/22384bbc61284f2fdf94 (Japanese).
My evaluation told that:
I first thought that this result is as I expected because ChainerMN recommends to use Infiniband, but p2.16xlarge has only 20Gbps network bandwidth. However, ChainerMN used only 6.0Gbps during my experiment. So, network was not the bottleneck.
I investigated more, and it seemed that sys CPU usage is increased on the experiment of multi-nodes.
So, the reason would be because kernel syscalls steal CPU of chainer process, probably for network traffic, and chainer process can not perform main task well. But, this is just my guess.
Do you guys have any reasons to explain this, and have any idea to improve performance on AWS?