YukeWang96 / MGG_OSDI23

Artifact for OSDI'23: MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms.
37 stars 4 forks source link
deeplearning gnn gpu graph graphneuralnetwork multi-gpu nvshmem

Artifact for OSDI'23 paper

Yuke Wang, et al. Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms. OSDI'23.

[Paper] [Bibtex] DOI

1. Setup (Skip to Section-2 if evaluated on provided GCP)

1.1. Clone this project from Github.

git clone --recursive git@github.com:YukeWang96/MGG-OSDI23-AE.git

1.2. Download libraries and datasets.

wget https://proj-dat.s3.us-west-1.amazonaws.com/roc-new.tar.gz && tar -zxvf roc-new.tar.gz && rm roc-new.tar.gz

1.3. Launch Docker for MGG.

cd docker 

1.4. Compile implementation.

mkdir build && cd build && cmake .. && cd ..

2. Run initial test experiment.

3. Reproduce the major results from paper.

3.1 Compare with UVM on 4xA100 and 8xA100 (Fig.8a and Fig.8b).


Note that the results can be found at Fig_8_UVM_MGG_4GPU_GCN.csv, Fig_8_UVM_MGG_4GPU_GIN.csv, Fig_8_UVM_MGG_8GPU_GCN.csv, and Fig_8_UVM_MGG_8GPU_GIN.csv.

3.2 Compare with DGL on 8xA100 for GCN and GIN (Fig.7a and Fig.7b).

cd gcn/
cd ../gin/

Note that the results can be found at 1_dgl_gin.csv and 1_dgl_gcn.csv and our MGG reference is in MGG_GCN_8GPU.csv and MGG_8GPU_GIN.csv.

3.3 Compare with ROC on 8xA100 (Fig.9).

cd roc-new/docker

Note that the results can be found at Fig_9_ROC_MGG_8GPU_GCN.csv and Fig_9_ROC_MGG_8GPU_GIN.csv.

Results of ROC is similar as

Dataset Time (ms)
reddit 425.67
enwiki-2013 619.33
it-2004 5160.18
paper100M 8179.35
ogbn-products 529.74
ogbn-proteins 423.82
com-orkut 571.62

3.4 Compare NP with w/o NP (Fig.10a).

python 2_MGG_NP.py

Note that the results can be found at MGG_NP_study.csv. Similar to following table.

Dataset MGG_WO_NP MGG_W_NP Speedup (x)
Reddit 76.797 16.716 4.594
enwiki-2013 290.169 88.249 3.288
ogbn-product 86.362 26.008 3.321

3.5 Compare WL with w/o WL (Fig.10b).

python 3_MGG_WL.py

Note that the results can be found at MGG_WL_study.csv. Results are similar to

Dataset MGG_WO_NP MGG_W_NP Speedup (x)
Reddit 75.035 18.92 3.966
enwiki-2013 292.022 104.878 2.784
ogbn-product 86.632 29.941 2.893

3.6 Compare API (Fig.10c).

python 4_MGG_API.py

Note that the results can be found at MGG_API_study.csv. Results are similar to

Norm.Time w.r.t. Thread MGG_Thread MGG_Warp MGG_Block
Reddit 1.0 0.299 0.295
enwiki-2013 1.0 0.267 0.263
ogbn-product 1.0 0.310 0.317

3.7 Design Space Search (Fig.11a)

python 5_MGG_DSE_4GPU.py

Note that the results can be found at Reddit_4xA100_dist_ps.csv and Reddit_4xA100_dist_wpb.csv. Results similar to

dist\ps 1 2 4 8 16 32
1 17.866 17.459 16.821 16.244 16.711 17.125
2 17.247 16.722 16.437 16.682 17.053 17.808
4 16.826 16.41 16.583 17.217 17.627 18.298
8 16.271 16.725 17.193 17.655 18.426 18.99
16 16.593 17.214 17.617 18.266 19.009 19.909
dist\wpb 1 2 4 8 16
1 34.773 23.164 16.576 15.235 16.519
2 34.599 23.557 17.254 15.981 19.56
4 34.835 23.616 17.674 17.034 22.084
8 34.729 23.817 18.302 18.708 25.656
16 34.803 24.161 18.879 23.44 32.978
python 5_MGG_DSE_8GPU.py

Note that the results can be found at Reddit_8xA100_dist_ps.csv and Reddit_8xA100_dist_wpb.csv.

4. Use MGG as a Tool or Library for your project.

Building a new design based on MGG with NVSHMEM is simple, there are only several steps:

4.1 Build the C++ design based on our existing examples


4.2 Build the CUDA kernel design based on our existing examples.




4.3 Register the new design to CMake.



4.4 Launch the MGG docker and recompile,

4.5 Run the compiled executable.

