Open gfursin opened 5 years ago
According to these measurements, P3 improves data parallel training throughput by about 1.5x over baseline. This result closely matches with the numbers in the original paper. VGG-19 is a large model with total gradient around 500 MB synchronized through the parameter server on each iteration. Based on our experience, 25 Gbps network is not sufficient to linearly scale VGG-19 on 4 machines.
This result shows that P3 can provide better speedup over baseline under limited bandwidth conditions. In our experiments, using 4 P4000 GPUs and adjusting network bandwidth from 5-30 Gbps, we managed to get 66% peak speedup for VGG-19 over baseline.
That makes sense - thanks, Anand!
Setup is similar to #1 : 1 master machine and 4 nodes with GPU Nvidia GTX 1080 Ti and Intel Ethernet Controller XXV710 for 25GbE SFP28 (Lille nodes).
Results from P3:
Complete log: experiment-vgg-github.log
Baseline results:
Complete log: experiment-vgg-baseline.log