This repository contains the reproducibility report for the SysML'19 paper "Priority-based Parameter Propagation for Distributed DNN Training". Feel free to continue evaluating all experimental results from this paper and report your feedback here.
We implemented a simple CK workflow (pipeline) with shared CK packages for this project, models and datasets to automate and facilitate validation of results.
Install CK as described here.
$ ck pull repo:reproduce-sysml19-paper-p3
Note that CK will pull all other related repositories. If you already have installed CK repositories, you may update them at any time all as follows:
$ ck pull all
Install P3 tool from this paper via CK either from GitHub or Zenodo:
$ ck install package:sysml19-p3-github
or
$ ck install package:sysml19-p3-zenodo
CK will automatically attempt to detect GCC, CUDA, cuDNN, and install OpenCV and OpenBLAS to a user space.
Install small ImageNet1K train data set just to test workflow (with batch size 1):
$ ck install package:imagenet-2012-train-min
Install a package which will convert this dataset to P3 format:
$ ck install package:dataset-imagenet-2012-train-p3
Later you can install a complete ImageNet1K train data set (may take 1 day to download and may require 500GB of space)
$ ck install package:imagenet-2012-train
$ ck install package:dataset-imagenet-2012-train-p3
Note that if you already have ImageNet1K downloaded and extracted somewhere, you can ask CK to detect it rather then downloading it again:
$ ck detect soft:dataset.imagenet.train --search_dirs={path to downloaded and extracted ImageNet1K}
We created CK program workflow (pipeline) with meta information which describes dependencies (code, models and data sets), automates their installation during the first execution (P3, data sets, etc) and assembles different command lines.
Pre-processing CK script prepares list of hosts to run experiments: preprocess.py. Post-processing CK script parses output and unifies different metrics: postprocess.py.
You need to register a list of hosts to run experiments. You can do it as follows:
Just create a "hosts.json" with a list of IPs (make sure that you can ssh there without a password):
["chifflet-2", "chifflet-4"]
Now you must register this configuration in the CK with some name such as "grid5000" as follows:
$ ck add machine:grid5000 --type=cluster --config_file=hosts.json
When asked about remote node OS, just select linux-64. You can view all registered configurations of target platforms as follows:
$ ck show machine
You can now run ImageNet experiments as follows:
$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=resnet
$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=inception-v3
$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=vgg
You can change default batch size (32) as follows:
$ ck run program:sysml19-p3 --target=grid5000 \
--cmd_key=resnet \
--env.BATCH_SIZE=32
You can also run IWSLT15 experiments as follows:
$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=sockeye --env.OUTPUT_FILE=/tmp/sockeye_1.5-iwslt15_en-vi.sh
Validated results on GRID5000: link.
We expect the community to continue validating results from this and other SysML'19 papers (see our notes and example).
We awarded the following badges based on above evaluation: