ctuning / reproduce-sysml19-paper-p3

Reproducibility report and the Collective Knowledge workflow for the SysML'19 paper "Priority-based Parameter Propagation for Distributed DNN Training"
http://sysml.cc
Other
1 stars 1 forks source link

checking resnet on GRID5000 #2

Open gfursin opened 5 years ago

gfursin commented 5 years ago

Setup is similar to #1 : 1 master machine and 4 nodes with GPU Nvidia GTX 1080 Ti and Intel Ethernet Controller XXV710 for 25GbE SFP28 (Lille nodes).

Results from P3:

INFO:root:Epoch[0] Batch [20]   Speed: 162.39 samples/sec       accuracy=0.000000
INFO:root:Epoch[0] Batch [20]   Speed: 162.08 samples/sec       accuracy=0.001488
INFO:root:Epoch[0] Batch [20]   Speed: 162.08 samples/sec       accuracy=0.000000
INFO:root:Epoch[0] Batch [20]   Speed: 162.04 samples/sec       accuracy=0.000000
INFO:root:Epoch[0] Batch [40]   Speed: 161.64 samples/sec       accuracy=0.001563
INFO:root:Epoch[0] Batch [40]   Speed: 161.78 samples/sec       accuracy=0.001563
INFO:root:Epoch[0] Batch [40]   Speed: 161.64 samples/sec       accuracy=0.000000
INFO:root:Epoch[0] Batch [40]   Speed: 161.59 samples/sec       accuracy=0.001563

...

INFO:root:Epoch[0] Batch [9500] Speed: 165.09 samples/sec       accuracy=0.207813
INFO:root:Epoch[0] Batch [9500] Speed: 164.87 samples/sec       accuracy=0.212500
INFO:root:Epoch[0] Batch [9500] Speed: 164.48 samples/sec       accuracy=0.209375
INFO:root:Epoch[0] Batch [9500] Speed: 164.49 samples/sec       accuracy=0.181250
INFO:root:Epoch[0] Train-accuracy=0.265625
INFO:root:Epoch[0] Time cost=1871.860
...
INFO:root:Total arg_params size = 25549486
INFO:root:Epoch[0] Validation-accuracy=0.231937

Complete log: experiment-resnet-github.log

Baseline results:

INFO:root:Epoch[0] Batch [20]   Speed: 142.85 samples/sec       accuracy=0.002976
INFO:root:Epoch[0] Batch [20]   Speed: 141.86 samples/sec       accuracy=0.001488
INFO:root:Epoch[0] Batch [20]   Speed: 141.59 samples/sec       accuracy=0.002976
INFO:root:Epoch[0] Batch [20]   Speed: 141.60 samples/sec       accuracy=0.000000
INFO:root:Epoch[0] Batch [40]   Speed: 146.49 samples/sec       accuracy=0.000000
INFO:root:Epoch[0] Batch [40]   Speed: 145.71 samples/sec       accuracy=0.000000
INFO:root:Epoch[0] Batch [40]   Speed: 146.13 samples/sec       accuracy=0.003125
INFO:root:Epoch[0] Batch [40]   Speed: 146.16 samples/sec       accuracy=0.001563

...

INFO:root:Epoch[0] Batch [9500] Speed: 142.21 samples/sec       accuracy=0.203125
INFO:root:Epoch[0] Batch [9500] Speed: 142.16 samples/sec       accuracy=0.240625
INFO:root:Epoch[0] Batch [9500] Speed: 142.10 samples/sec       accuracy=0.207813
INFO:root:Epoch[0] Train-accuracy=0.218750
INFO:root:Epoch[0] Time cost=2142.595
...
INFO:root:Total arg_params size = 25549486
INFO:root:Epoch[0] Validation-accuracy=0.237563

Complete log: experiment-resnet-baseline.log

anandj91 commented 5 years ago

As per these measurements, P3 sped up the data parallel training throughput of ResNet-50 by 15%. This particular experiment uses 25 Gbps network. In our experience 25 Gbps is sufficient for linearly scaling ResNet-50 as it being relatively small model. This is the reason why P3 is not able to provide significant speedup over baseline. In the controlled experiment we conducted, we were getting a peak speed up of 25% for ResNet-50.

gfursin commented 5 years ago

That sounds good! Thanks!