ctuning / reproduce-sysml19-paper-p3

Reproducibility report and the Collective Knowledge workflow for the SysML'19 paper "Priority-based Parameter Propagation for Distributed DNN Training"
http://sysml.cc
Other
1 stars 1 forks source link

checking sockeye on GRID5000 #1

Open gfursin opened 5 years ago

gfursin commented 5 years ago

I am trying to test sockeye via CK on 5 GRID5000 machines (1 master and 4 nodes) with GPU Nvidia GTX 1080 Ti and Intel Ethernet Controller XXV710 for 25GbE SFP28 (Lille nodes)

I booked machines via:

$ oarsub -I -t allow_classic_ssh -p "cluster='chifflet'" -l nodes=1

I then uploaded CUDA 9.2 with cuDNN 7.3.0 to my home directroy and then installed P3 via CK on one of the machines as described in CK P3 README.

Here are logs about platform and CK installation:

$ ck detect platform

ck-platform.log

$ ck detect platform.gpgpu --cuda

ck-platform-gpgpu.log

$ ck show env

ck-env.log

I used one machine as master and then described 4 machines in the hosts.json file and added it to the CK machine:grid5000:

$ ck add machine:grid5000 --type=cluster --config_file=hosts.json

I then ran sockeye via CK program pipeline:

$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=sockeye --env.OUTPUT_FILE=/tmp/sockeye_1.5-iwslt15_en-vi.sh

I got the following long log (aborted after 1 hour timeout on booked machines): experiment-sockeye.log

Some results:

[2019-02-20 20:17:09,204:INFO:root:__call__] Epoch[0] Batch [100]       Speed: 130.08 samples/sec       perplexity=855.555172
[2019-02-20 20:17:09,205:INFO:root:__call__] Epoch[0] Batch [100]       Speed: 130.08 samples/sec       perplexity=855.555172
[2019-02-20 20:17:09,205:INFO:root:__call__] Epoch[0] Batch [100]       Speed: 130.08 samples/sec       perplexity=855.555172
[2019-02-20 20:17:09,229:INFO:root:__call__] Epoch[0] Batch [100]       Speed: 129.96 samples/sec       perplexity=855.555172
[2019-02-20 20:17:32,766:INFO:root:__call__] Epoch[0] Batch [200]       Speed: 135.82 samples/sec       perplexity=629.661028
[2019-02-20 20:17:32,768:INFO:root:__call__] Epoch[0] Batch [200]       Speed: 135.81 samples/sec       perplexity=629.661028
[2019-02-20 20:17:32,770:INFO:root:__call__] Epoch[0] Batch [200]       Speed: 135.79 samples/sec       perplexity=629.661028
[2019-02-20 20:17:32,786:INFO:root:__call__] Epoch[0] Batch [200]       Speed: 135.84 samples/sec       perplexity=629.661028
[2019-02-20 20:17:56,482:INFO:root:__call__] Epoch[0] Batch [300]       Speed: 134.95 samples/sec       perplexity=544.224197
[2019-02-20 20:17:56,485:INFO:root:__call__] Epoch[0] Batch [300]       Speed: 134.93 samples/sec       perplexity=544.224197
[2019-02-20 20:17:56,484:INFO:root:__call__] Epoch[0] Batch [300]       Speed: 134.92 samples/sec       perplexity=544.224197
[2019-02-20 20:17:56,497:INFO:root:__call__] Epoch[0] Batch [300]       Speed: 134.96 samples/sec       perplexity=544.224197
[2019-02-20 20:18:20,296:INFO:root:__call__] Epoch[0] Batch [400]       Speed: 134.39 samples/sec       perplexity=484.884992

...

[2019-02-20 20:59:56,749:INFO:root:__call__] Epoch[2] Batch [9900]      Speed: 122.43 samples/sec       perplexity=29.895473
[2019-02-20 20:59:56,748:INFO:root:__call__] Epoch[2] Batch [9900]      Speed: 122.43 samples/sec       perplexity=29.895473
[2019-02-20 20:59:56,750:INFO:root:__call__] Epoch[2] Batch [9900]      Speed: 122.43 samples/sec       perplexity=29.895473
[2019-02-20 20:59:56,765:INFO:root:__call__] Epoch[2] Batch [9900]      Speed: 122.43 samples/sec       perplexity=29.895473
[2019-02-20 21:00:22,995:INFO:root:__call__] Epoch[2] Batch [10000]     Speed: 121.92 samples/sec       perplexity=29.633392
[2019-02-20 21:00:22,998:INFO:root:__call__] Epoch[2] Batch [10000]     Speed: 121.91 samples/sec       perplexity=29.633392
[2019-02-20 21:00:22,999:INFO:root:__call__] Epoch[2] Batch [10000]     Speed: 121.91 samples/sec       perplexity=29.633392
[2019-02-20 21:00:23,013:INFO:root:__call__] Epoch[2] Batch [10000]     Speed: 121.92 samples/sec       perplexity=29.633392
gfursin commented 5 years ago

Here is the comparison with baseline:

[2019-02-21 13:19:35,972:INFO:root:__call__] Epoch[0] Batch [100]       Speed: 114.34 samples/sec       perplexity=827.708524
[2019-02-21 13:19:35,976:INFO:root:__call__] Epoch[0] Batch [100]       Speed: 114.42 samples/sec       perplexity=827.708524
[2019-02-21 13:19:35,976:INFO:root:__call__] Epoch[0] Batch [100]       Speed: 114.41 samples/sec       perplexity=827.708524
[2019-02-21 13:19:35,979:INFO:root:__call__] Epoch[0] Batch [100]       Speed: 114.39 samples/sec       perplexity=827.708524
[2019-02-21 13:20:02,349:INFO:root:__call__] Epoch[0] Batch [200]       Speed: 121.33 samples/sec       perplexity=618.833784
[2019-02-21 13:20:02,356:INFO:root:__call__] Epoch[0] Batch [200]       Speed: 121.29 samples/sec       perplexity=618.833784
[2019-02-21 13:20:02,362:INFO:root:__call__] Epoch[0] Batch [200]       Speed: 121.28 samples/sec       perplexity=618.833784
[2019-02-21 13:20:02,364:INFO:root:__call__] Epoch[0] Batch [200]       Speed: 121.28 samples/sec       perplexity=618.833784
[2019-02-21 13:20:28,995:INFO:root:__call__] Epoch[0] Batch [300]       Speed: 120.15 samples/sec       perplexity=531.093455
[2019-02-21 13:20:29,005:INFO:root:__call__] Epoch[0] Batch [300]       Speed: 120.05 samples/sec       perplexity=531.093455
[2019-02-21 13:20:29,004:INFO:root:__call__] Epoch[0] Batch [300]       Speed: 120.09 samples/sec       perplexity=531.093455
[2019-02-21 13:20:29,012:INFO:root:__call__] Epoch[0] Batch [300]       Speed: 120.09 samples/sec       perplexity=531.093455
[2019-02-21 13:20:55,952:INFO:root:__call__] Epoch[0] Batch [400]       Speed: 118.71 samples/sec       perplexity=471.731402

...

[2019-02-21 14:07:45,915:INFO:root:__call__] Epoch[2] Batch [9900]      Speed: 108.44 samples/sec       perplexity=30.107301
[2019-02-21 14:07:45,917:INFO:root:__call__] Epoch[2] Batch [9900]      Speed: 108.41 samples/sec       perplexity=30.107301
[2019-02-21 14:07:45,920:INFO:root:__call__] Epoch[2] Batch [9900]      Speed: 108.39 samples/sec       perplexity=30.107301
[2019-02-21 14:07:45,921:INFO:root:__call__] Epoch[2] Batch [9900]      Speed: 108.42 samples/sec       perplexity=30.107301
[2019-02-21 14:08:15,377:INFO:root:__call__] Epoch[2] Batch [10000]     Speed: 108.64 samples/sec       perplexity=29.837692
[2019-02-21 14:08:15,382:INFO:root:__call__] Epoch[2] Batch [10000]     Speed: 108.62 samples/sec       perplexity=29.837692
[2019-02-21 14:08:15,385:INFO:root:__call__] Epoch[2] Batch [10000]     Speed: 108.59 samples/sec       perplexity=29.837692
[2019-02-21 14:08:15,391:INFO:root:__call__] Epoch[2] Batch [10000]     Speed: 108.56 samples/sec       perplexity=29.837692

Log file: experiment-sockeye-baseline.log

anandj91 commented 5 years ago

Sockeye training is sped up by about 15% with P3. Since this is relatively small model, 25 Gbps network is sufficiently large for parameter synchronization. This is why P3 is not showing significant performance benefits over baseline. In our tightly controlled experiment, we managed to get 38% improvement on Sockeye.

gfursin commented 5 years ago

Sure, I see! That makes sense! Thanks a lot again, Anand, for your help!