JF-D / Proteus

10 stars 3 forks source link

Proteus

This is the official implementation of "Proteus: Simulating the Performance of Distributed DNN Training". [arXiv]

Proteus is the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named Strategy Tree. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, comp-comm overlap and bandwidth sharing, with a Hierarchical Topo-Aware Executor (HTAE). Proteus is evaluated across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves $3.0$% average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to $133.8$%.

Installation

First, compile the nccl in external/nccl by running the following commands:

cd external/nccl
make -j src.build

Then, add proteus to PYTHONPATH

pip install graphviz toposort
export PYTHONPATH=$PYTHONPATH:/path/to/proteus

Usage

Cluster Configuration

The cluster configuration is defined with a device topo file and a cluster json file. The device topo file specifies the topology of a single node, and the cluster json file specifies the cluster info. We provide some example topo files and cluster json files in examples/clusters/. The device topo file is generated by running nccl-tests with NCCL_TOPO_DUMP_FILE (link).

Run Examples

We provide some examples in examples/. Try Proteus with

cd examples
mkdir -p log
python alexnet.py -model alexnet -bs 256 -cluster clusters/dgx1_v100_2ib/n1_g1.json -ps dp --profile-iters 50

Citation

@article{duan2023proteus,
  title={Proteus: Simulating the Performance of Distributed DNN Training},
  author={Duan, Jiangfei and Li, Xiuhong and Xu, Ping and Zhang, Xingcheng and Yan, Shengen and Liang, Yun and Lin, Dahua},
  journal={arXiv preprint arXiv:2306.02267},
  year={2023}
}