ctuning / reproduce-sysml19-paper-p3

Reproducibility report and the Collective Knowledge workflow for the SysML'19 paper "Priority-based Parameter Propagation for Distributed DNN Training"
http://sysml.cc
Other
1 stars 1 forks source link

automation workflow

This repository contains the reproducibility report for the SysML'19 paper "Priority-based Parameter Propagation for Distributed DNN Training". Feel free to continue evaluating all experimental results from this paper and report your feedback here.

Artifact check-list (meta-information)

Installation

We implemented a simple CK workflow (pipeline) with shared CK packages for this project, models and datasets to automate and facilitate validation of results.

CK framework

Install CK as described here.

CK workflow (pipeline) for this paper

$ ck pull repo:reproduce-sysml19-paper-p3

Note that CK will pull all other related repositories. If you already have installed CK repositories, you may update them at any time all as follows:

$ ck pull all

Installing packages

Install P3 tool from this paper via CK either from GitHub or Zenodo:

$ ck install package:sysml19-p3-github
or
$ ck install package:sysml19-p3-zenodo

CK will automatically attempt to detect GCC, CUDA, cuDNN, and install OpenCV and OpenBLAS to a user space.

Install small ImageNet1K train data set just to test workflow (with batch size 1):

$ ck install package:imagenet-2012-train-min

Install a package which will convert this dataset to P3 format:

$ ck install package:dataset-imagenet-2012-train-p3

Later you can install a complete ImageNet1K train data set (may take 1 day to download and may require 500GB of space)

$ ck install package:imagenet-2012-train
$ ck install package:dataset-imagenet-2012-train-p3

Note that if you already have ImageNet1K downloaded and extracted somewhere, you can ask CK to detect it rather then downloading it again:

$ ck detect soft:dataset.imagenet.train --search_dirs={path to downloaded and extracted ImageNet1K}

Evaluation

We created CK program workflow (pipeline) with meta information which describes dependencies (code, models and data sets), automates their installation during the first execution (P3, data sets, etc) and assembles different command lines.

Pre-processing CK script prepares list of hosts to run experiments: preprocess.py. Post-processing CK script parses output and unifies different metrics: postprocess.py.

Cluster preparation

You need to register a list of hosts to run experiments. You can do it as follows:

Just create a "hosts.json" with a list of IPs (make sure that you can ssh there without a password):

["chifflet-2", "chifflet-4"]

Now you must register this configuration in the CK with some name such as "grid5000" as follows:

$ ck add machine:grid5000 --type=cluster --config_file=hosts.json

When asked about remote node OS, just select linux-64. You can view all registered configurations of target platforms as follows:

$ ck show machine

ImageNet experiments

You can now run ImageNet experiments as follows:

$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=resnet
$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=inception-v3
$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=vgg

You can change default batch size (32) as follows:

$ ck run program:sysml19-p3 --target=grid5000 \
                            --cmd_key=resnet \
                            --env.BATCH_SIZE=32

IWSLT15 experiments

You can also run IWSLT15 experiments as follows:

$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=sockeye --env.OUTPUT_FILE=/tmp/sockeye_1.5-iwslt15_en-vi.sh

Validated results on GRID5000: link.

Suggestions

We expect the community to continue validating results from this and other SysML'19 papers (see our notes and example).

Reproducibility badges

We awarded the following badges based on above evaluation:

ACM badges

cTuning foundation badges

automation workflow