Yuke Wang, et al. Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms. OSDI'23.
git clone --recursive git@github.com:YukeWang96/MGG-OSDI23-AE.git
Download libraries (cudnn-v8.2, nvshmem_src_2.0.3-0, openmpi-4.1.1
).
wget https://proj-dat.s3.us-west-1.amazonaws.com/local.tar.gz
tar -zxvf local.tar.gz && rm local.tar.gz
tar -zxvf local/nvshmem_src_2.0.3-0/build_cu112.tar.gz
wget https://proj-dat.s3.us-west-1.amazonaws.com/dataset.tar.gz && tar -zxvf dataset.tar.gz && rm dataset.tar.gz
Setup baseline DGL
cd dgl_pydirect_internal
wget https://proj-dat.s3.us-west-1.amazonaws.com/graphdata.tar.gz && tar -zxvf graphdata.tar.gz && rm graphdata.tar.gz
cd ..
Setup baseline ROC
wget https://proj-dat.s3.us-west-1.amazonaws.com/roc-new.tar.gz && tar -zxvf roc-new.tar.gz && rm roc-new.tar.gz
cd docker
./launch.sh
mkdir build && cd build && cmake .. && cd ..
./0_mgg_build.sh
./0_run_MGG_UVM_4GPU_GCN.sh
./0_run_MGG_UVM_4GPU_GIN.sh
./0_run_MGG_UVM_8GPU_GCN.sh
./0_run_MGG_UVM_8GPU_GIN.sh
Note that the results can be found at
Fig_8_UVM_MGG_4GPU_GCN.csv
,Fig_8_UVM_MGG_4GPU_GIN.csv
,Fig_8_UVM_MGG_8GPU_GCN.csv
, andFig_8_UVM_MGG_8GPU_GIN.csv
.
./launch_docker.sh
cd gcn/
./0_run_gcn.sh
cd ../gin/
./0_run_gin.sh
Note that the results can be found at
1_dgl_gin.csv
and1_dgl_gcn.csv
and our MGG reference is inMGG_GCN_8GPU.csv
andMGG_8GPU_GIN.csv
.
cd roc-new/docker
./launch.sh
./run_all.sh
Note that the results can be found at
Fig_9_ROC_MGG_8GPU_GCN.csv
andFig_9_ROC_MGG_8GPU_GIN.csv
.
Results of ROC is similar as
Dataset | Time (ms) |
---|---|
425.67 | |
enwiki-2013 | 619.33 |
it-2004 | 5160.18 |
paper100M | 8179.35 |
ogbn-products | 529.74 |
ogbn-proteins | 423.82 |
com-orkut | 571.62 |
python 2_MGG_NP.py
Note that the results can be found at
MGG_NP_study.csv
. Similar to following table.
Dataset | MGG_WO_NP | MGG_W_NP | Speedup (x) |
---|---|---|---|
76.797 | 16.716 | 4.594 | |
enwiki-2013 | 290.169 | 88.249 | 3.288 |
ogbn-product | 86.362 | 26.008 | 3.321 |
python 3_MGG_WL.py
Note that the results can be found at
MGG_WL_study.csv
. Results are similar to
Dataset | MGG_WO_NP | MGG_W_NP | Speedup (x) |
---|---|---|---|
75.035 | 18.92 | 3.966 | |
enwiki-2013 | 292.022 | 104.878 | 2.784 |
ogbn-product | 86.632 | 29.941 | 2.893 |
python 4_MGG_API.py
Note that the results can be found at
MGG_API_study.csv
. Results are similar to
Norm.Time w.r.t. Thread | MGG_Thread | MGG_Warp | MGG_Block |
---|---|---|---|
1.0 | 0.299 | 0.295 | |
enwiki-2013 | 1.0 | 0.267 | 0.263 |
ogbn-product | 1.0 | 0.310 | 0.317 |
python 5_MGG_DSE_4GPU.py
Note that the results can be found at
Reddit_4xA100_dist_ps.csv
andReddit_4xA100_dist_wpb.csv
. Results similar to
Reddit_4xA100_dist_ps.csv
dist\ps | 1 | 2 | 4 | 8 | 16 | 32 |
---|---|---|---|---|---|---|
1 | 17.866 | 17.459 | 16.821 | 16.244 | 16.711 | 17.125 |
2 | 17.247 | 16.722 | 16.437 | 16.682 | 17.053 | 17.808 |
4 | 16.826 | 16.41 | 16.583 | 17.217 | 17.627 | 18.298 |
8 | 16.271 | 16.725 | 17.193 | 17.655 | 18.426 | 18.99 |
16 | 16.593 | 17.214 | 17.617 | 18.266 | 19.009 | 19.909 |
Reddit_4xA100_dist_wpb.csv
dist\wpb | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
1 | 34.773 | 23.164 | 16.576 | 15.235 | 16.519 |
2 | 34.599 | 23.557 | 17.254 | 15.981 | 19.56 |
4 | 34.835 | 23.616 | 17.674 | 17.034 | 22.084 |
8 | 34.729 | 23.817 | 18.302 | 18.708 | 25.656 |
16 | 34.803 | 24.161 | 18.879 | 23.44 | 32.978 |
python 5_MGG_DSE_8GPU.py
Note that the results can be found at
Reddit_8xA100_dist_ps.csv
andReddit_8xA100_dist_wpb.csv
.
Building a new design based on MGG with NVSHMEM is simple, there are only several steps:
.cu
file under src/
. An example is shown below.include/neighbor_utils.cuh
. An example is shown below.CMakeLists.txt
).make filename.cu
in 0_mgg_build.cu
. .cu
in step-1.build/
.
cd docker
./launch.sh
cd build && cmake ..
cd .. && ./0_mgg_build.sh
NVIDIA OpenSHMEM Library (NVSHMEM).
https://docs.nvidia.com/nvshmem/api/index.html
NVIDIA Unified Memory.
https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
NVIDIA Unified Virtual Memory.
https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/
NVIDIA cuBLAS.
https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuBLAS/Level-3/gemm
cuDNN Example for MNIST.
https://github.com/haanjack/mnist-cudnn
graph_project_start
Hang Liu. https://github.com/asherliu/graph_project_start.git
Deep Graph Library
Wang, Minjie, et al.
Deep graph library: A graph-centric, highly-performant package for graph neural networks.. The International Conference on Learning Representations (ICLR'19).
ROC
Jia, Zhihao, et al.
Improving the accuracy, scalability, and performance of graph neural networks with roc. Proceedings of Machine Learning and Systems (MLsys'20).
GNNAdvisor
Wang, Yuke, et al. GNNAdvisor: An adaptive and efficient runtime system for GNN acceleration on GPUs. 15th USENIX symposium on operating systems design and implementation (OSDI'21).
GE-SpMM
Huang, Guyue, et al. Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. International Conference for High Performance Computing, Networking, Storage and Analysis (SC'20).
Bit-Tensor-Core
Li, Ang, and Simon Su. Accelerating binarized neural networks via bit-tensor-cores in turing gpus. IEEE Transactions on Parallel and Distributed Systems (TPDS'20).