Open gfursin opened 5 years ago
Here is the comparison with baseline:
[2019-02-21 13:19:35,972:INFO:root:__call__] Epoch[0] Batch [100] Speed: 114.34 samples/sec perplexity=827.708524
[2019-02-21 13:19:35,976:INFO:root:__call__] Epoch[0] Batch [100] Speed: 114.42 samples/sec perplexity=827.708524
[2019-02-21 13:19:35,976:INFO:root:__call__] Epoch[0] Batch [100] Speed: 114.41 samples/sec perplexity=827.708524
[2019-02-21 13:19:35,979:INFO:root:__call__] Epoch[0] Batch [100] Speed: 114.39 samples/sec perplexity=827.708524
[2019-02-21 13:20:02,349:INFO:root:__call__] Epoch[0] Batch [200] Speed: 121.33 samples/sec perplexity=618.833784
[2019-02-21 13:20:02,356:INFO:root:__call__] Epoch[0] Batch [200] Speed: 121.29 samples/sec perplexity=618.833784
[2019-02-21 13:20:02,362:INFO:root:__call__] Epoch[0] Batch [200] Speed: 121.28 samples/sec perplexity=618.833784
[2019-02-21 13:20:02,364:INFO:root:__call__] Epoch[0] Batch [200] Speed: 121.28 samples/sec perplexity=618.833784
[2019-02-21 13:20:28,995:INFO:root:__call__] Epoch[0] Batch [300] Speed: 120.15 samples/sec perplexity=531.093455
[2019-02-21 13:20:29,005:INFO:root:__call__] Epoch[0] Batch [300] Speed: 120.05 samples/sec perplexity=531.093455
[2019-02-21 13:20:29,004:INFO:root:__call__] Epoch[0] Batch [300] Speed: 120.09 samples/sec perplexity=531.093455
[2019-02-21 13:20:29,012:INFO:root:__call__] Epoch[0] Batch [300] Speed: 120.09 samples/sec perplexity=531.093455
[2019-02-21 13:20:55,952:INFO:root:__call__] Epoch[0] Batch [400] Speed: 118.71 samples/sec perplexity=471.731402
...
[2019-02-21 14:07:45,915:INFO:root:__call__] Epoch[2] Batch [9900] Speed: 108.44 samples/sec perplexity=30.107301
[2019-02-21 14:07:45,917:INFO:root:__call__] Epoch[2] Batch [9900] Speed: 108.41 samples/sec perplexity=30.107301
[2019-02-21 14:07:45,920:INFO:root:__call__] Epoch[2] Batch [9900] Speed: 108.39 samples/sec perplexity=30.107301
[2019-02-21 14:07:45,921:INFO:root:__call__] Epoch[2] Batch [9900] Speed: 108.42 samples/sec perplexity=30.107301
[2019-02-21 14:08:15,377:INFO:root:__call__] Epoch[2] Batch [10000] Speed: 108.64 samples/sec perplexity=29.837692
[2019-02-21 14:08:15,382:INFO:root:__call__] Epoch[2] Batch [10000] Speed: 108.62 samples/sec perplexity=29.837692
[2019-02-21 14:08:15,385:INFO:root:__call__] Epoch[2] Batch [10000] Speed: 108.59 samples/sec perplexity=29.837692
[2019-02-21 14:08:15,391:INFO:root:__call__] Epoch[2] Batch [10000] Speed: 108.56 samples/sec perplexity=29.837692
Log file: experiment-sockeye-baseline.log
Sockeye training is sped up by about 15% with P3. Since this is relatively small model, 25 Gbps network is sufficiently large for parameter synchronization. This is why P3 is not showing significant performance benefits over baseline. In our tightly controlled experiment, we managed to get 38% improvement on Sockeye.
Sure, I see! That makes sense! Thanks a lot again, Anand, for your help!
I am trying to test sockeye via CK on 5 GRID5000 machines (1 master and 4 nodes) with GPU Nvidia GTX 1080 Ti and Intel Ethernet Controller XXV710 for 25GbE SFP28 (Lille nodes)
I booked machines via:
I then uploaded CUDA 9.2 with cuDNN 7.3.0 to my home directroy and then installed P3 via CK on one of the machines as described in CK P3 README.
Here are logs about platform and CK installation:
ck-platform.log
ck-platform-gpgpu.log
ck-env.log
I used one machine as master and then described 4 machines in the hosts.json file and added it to the CK machine:grid5000:
I then ran sockeye via CK program pipeline:
I got the following long log (aborted after 1 hour timeout on booked machines): experiment-sockeye.log
Some results: