ctuning / reproduce-sysml19-paper-p3

Reproducibility report and the Collective Knowledge workflow for the SysML'19 paper "Priority-based Parameter Propagation for Distributed DNN Training"
http://sysml.cc
Other
1 stars 1 forks source link

checking vgg on GRID5000 #3

Open gfursin opened 5 years ago

gfursin commented 5 years ago

Setup is similar to #1 : 1 master machine and 4 nodes with GPU Nvidia GTX 1080 Ti and Intel Ethernet Controller XXV710 for 25GbE SFP28 (Lille nodes).

Results from P3:

INFO:root:Epoch[0] Batch [20]   Speed: 33.01 samples/sec        accuracy=0.001488
INFO:root:Epoch[0] Batch [20]   Speed: 33.02 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [20]   Speed: 33.02 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [20]   Speed: 31.98 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [40]   Speed: 33.67 samples/sec        accuracy=0.001563
INFO:root:Epoch[0] Batch [40]   Speed: 33.66 samples/sec        accuracy=0.004687
INFO:root:Epoch[0] Batch [40]   Speed: 33.66 samples/sec        accuracy=0.004687
INFO:root:Epoch[0] Batch [40]   Speed: 33.67 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [60]   Speed: 32.59 samples/sec        accuracy=0.001563
INFO:root:Epoch[0] Batch [60]   Speed: 32.57 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [60]   Speed: 32.59 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [60]   Speed: 32.57 samples/sec        accuracy=0.001563

...

INFO:root:Epoch[0] Batch [1000] Speed: 33.25 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [1000] Speed: 33.31 samples/sec        accuracy=0.001563
INFO:root:Epoch[0] Batch [1000] Speed: 33.24 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [1000] Speed: 33.24 samples/sec        accuracy=0.000000

Complete log: experiment-vgg-github.log

Baseline results:

INFO:root:Epoch[0] Batch [20]   Speed: 22.80 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [20]   Speed: 22.79 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [20]   Speed: 22.74 samples/sec        accuracy=0.001488
INFO:root:Epoch[0] Batch [20]   Speed: 22.60 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [40]   Speed: 22.89 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [40]   Speed: 23.01 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [40]   Speed: 22.81 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [40]   Speed: 22.80 samples/sec        accuracy=0.000000
INFO:root:Epoch[0] Batch [60]   Speed: 23.01 samples/sec        accuracy=0.001563
INFO:root:Epoch[0] Batch [60]   Speed: 22.99 samples/sec        accuracy=0.003125
INFO:root:Epoch[0] Batch [60]   Speed: 22.81 samples/sec        accuracy=0.001563
INFO:root:Epoch[0] Batch [60]   Speed: 23.05 samples/sec        accuracy=0.004687
...
INFO:root:Epoch[0] Batch [1000] Speed: 21.48 samples/sec        accuracy=0.006250
INFO:root:Epoch[0] Batch [1000] Speed: 21.35 samples/sec        accuracy=0.001563
INFO:root:Epoch[0] Batch [1000] Speed: 21.51 samples/sec        accuracy=0.001563
INFO:root:Epoch[0] Batch [1000] Speed: 21.31 samples/sec        accuracy=0.001563

Complete log: experiment-vgg-baseline.log

anandj91 commented 5 years ago

According to these measurements, P3 improves data parallel training throughput by about 1.5x over baseline. This result closely matches with the numbers in the original paper. VGG-19 is a large model with total gradient around 500 MB synchronized through the parameter server on each iteration. Based on our experience, 25 Gbps network is not sufficient to linearly scale VGG-19 on 4 machines.

This result shows that P3 can provide better speedup over baseline under limited bandwidth conditions. In our experiments, using 4 P4000 GPUs and adjusting network bandwidth from 5-30 Gbps, we managed to get 66% peak speedup for VGG-19 over baseline.

gfursin commented 5 years ago

That makes sense - thanks, Anand!