alibaba / graphlearn-for-pytorch

A GPU-accelerated graph learning library for PyTorch, facilitating the scaling of GNN training and inference.
Apache License 2.0
113 stars 34 forks source link
deep-learning distributed gpu graph-neural-networks pytorch

GLT-pypi docs GLT CI License

GraphLearn-for-PyTorch(GLT) is a graph learning library for PyTorch that makes distributed GNN training and inference easy and efficient. It leverages the power of GPUs to accelerate graph sampling and utilizes UVA to reduce the conversion and copying of features of vertices and edges. For large-scale graphs, it supports distributed training on multiple GPUs or multiple machines through fast distributed sampling and feature lookup. Additionally, it provides flexible deployment for distributed training to meet different requirements.

Highlighted Features

Architecture Overview

The main goal of GLT is to leverage hardware resources like GPU/NVLink/RDMA and characteristics of GNN models to accelerate end-to-end GNN training in both the single-machine and distributed environments.

In the case of multi-GPU training, graph sampling and CPU-GPU data transferring could easily become the major performance bottleneck. To speed up graph sampling and feature lookup, GLT implements the Unified Tensor Storage to unify the memory management of CPU and GPU. Based on this storage, GLT supports both CPU-based and GPU-based graph operators such as neighbor sampling, negative sampling, feature lookup, subgraph sampling etc. To alleviate the CPU-GPU data transferring overheads incurred by feature collection, GLT supports caching features of hot vertices in GPU memory, and accessing the remaining feature data (stored in pinned memory) via UVA. We further utilize the high-speed NVLink between GPUs expand the capacity of GPU cache.

As for distributed training, to prevent remote data access from blocking the progress of model training, GLT implements an efficient RPC framework on top of PyTorch RPC and adopts asynchronous graph sampling and feature lookup operations to hide the network latency and boost the end-to-end training throughput.

To lower the learning curve for PyG users, the APIs of key abstractions in GLT, such as dataset and dataloader, are designed to be compatible with PyG. Thus PyG users can take full advantage of GLT's acceleration capabilities by only modifying very few lines of code.

For model training, GLT supports different models to fit different scales of real-world graphs. It allows users to collocate model training and graph sampling (including feature lookup) in the same process, or separate them into different processes or even different machines. We provide two example to illustrate the training process on small graphs: single GPU training example and multi-GPU training example. For large-scale graphs, GLT separates sampling and training processes for asynchronous and parallel acceleration, and supports deployment of sampling and training processes on the same or different machines. Examples of distributed training can be found in distributed examples.

Installation

Requirements

# glibc>=2.14, torch>=1.13
pip install graphlearn-torch

Build from source

Install Dependencies

git submodule update --init
sh install_dependencies.sh

Python

  1. Build
    python setup.py bdist_wheel
    pip install dist/*

Build in CPU-mode

WITH_CUDA=OFF python setup.py bdist_wheel
pip install dist/*
  1. UT
    sh scripts/run_python_ut.sh

C++

If you need to test C++ operations, you can only build the C++ part.

  1. Build
    cmake .
    make -j
  2. UT
    sh scripts/run_cpp_ut.sh

    Quick Tour

Accelarating PyG model training on a single GPU.

Let's take PyG's GraphSAGE on OGBN-Products as an example, you only need to replace PyG's torch_geometric.loader.NeighborSampler by the graphlearn_torch.loader.NeighborLoader to benefit from the the acceleration of model training using GLT.

import torch
import graphlearn_torch as glt
import os.path as osp

from ogb.nodeproppred import PygNodePropPredDataset

# PyG's original code preparing the ogbn-products dataset
root = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'products')
dataset = PygNodePropPredDataset('ogbn-products', root)
split_idx = dataset.get_idx_split()
data = dataset[0]

# Enable GLT acceleration on PyG requires only replacing
# PyG's NeighborSampler with the following code.
glt_dataset = glt.data.Dataset()
glt_dataset.build(edge_index=data.edge_index,
                  feature_data=data.x,
                  sort_func=glt.data.sort_by_in_degree,
                  split_ratio=0.2,
                  label=data.y,
                  device=0)
train_loader = glt.loader.NeighborLoader(glt_dataset,
                                         [15, 10, 5],
                                         split_idx['train'],
                                         batch_size=1024,
                                         shuffle=True,
                                         drop_last=True,
                                         as_pyg_v1=True)

The complete example can be found in examples/train_sage_ogbn_products.py.

While building the `glt_dataset`, the GPU where the graph sampling operations are performed is specified by parameter `device`. By default, the graph topology are stored in pinned memory for ZERO-COPY access. Users can also choose to stored the graph topology in GPU by specifying `graph_mode='CUDA` in [`graphlearn_torch.data.Dataset.build`](graphlearn_torch.data.Dataset.build). The `split_ratio` determines the fraction of feature data to be cached in GPU. By default, GLT sorts the vertices in descending order according to vertex indegree and selects vetices with higher indegree for feature caching. The default sort function used as the input parameter for [`graphlearn_torch.data.Dataset.build`](graphlearn_torch.data.Dataset.build) is [`graphlearn_torch.data.reorder.sort_by_in_degree`](graphlearn_torch.data.reorder.sort_by_in_degree). Users can also customize their own sort functions with compatible APIs.

Distributed training

For PyTorch DDP distributed training, there are usually several steps as follows:

First, load the graph and feature from partitions.

import torch
import os.path as osp
import graphlearn_torch as glt

# load from partitions and create distributed dataset.
# Partitions are generated by following script:
# `python partition_ogbn_dataset.py --dataset=ogbn-products --num_partitions=2`

root = osp.join(osp.dirname(osp.realpath(__file__)), '..', '..', 'data', 'products')
glt_dataset = glt.distributed.DistDataset()
glt_dataset.load(
  num_partitions=2,
  partition_idx=int(os.environ['RANK']),
  graph_dir=osp.join(root, 'ogbn-products-graph-partitions'),
  feature_dir=osp.join(root, 'ogbn-products-feature-partitions'),
  label_file=osp.join(root, 'ogbn-products-label', 'label.pt') # whole label
)
train_idx = torch.load(osp.join(root, 'ogbn-products-train-partitions',
                                'partition' + str(os.environ['RANK']) + '.pt'))

Second, create distributed neighbor loader based on the dataset above.

# distributed neighbor loader
train_loader = glt.distributed.DistNeighborLoader(
  data=glt_dataset,
  num_neighbors=[15, 10, 5],
  input_nodes=train_idx,
  batch_size=batch_size,
  drop_last=True,
  collect_features=True,
  to_device=torch.device(rank % torch.cuda.device_count()),
  worker_options=glt.distributed.MpDistSamplingWorkerOptions(
    num_workers=nsampling_proc_per_train,
    worker_devices=[torch.device('cuda', (i + rank) % torch.cuda.device_count())
                    for i in range(nsampling_proc_per_train)],
    worker_concurrency=4,
    master_addr='localhost',
    master_port=12345, # different from port in pytorch training.
    channel_size='2GB',
    pin_memory=True
  )
)

Finally, define DDP model and run.

from torch.nn.parallel import DistributedDataParallel
from torch_geometric.nn import GraphSAGE

# DDP model
model = GraphSAGE(
  in_channels=num_features,
  hidden_channels=256,
  num_layers=3,
  out_channels=num_classes,
).to(rank)
model = DistributedDataParallel(model, device_ids=[rank])
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# training.
for epoch in range(0, epochs):
  model.train()
  for batch in train_loader:
    optimizer.zero_grad()
    out = model(batch.x, batch.edge_index)[:batch.batch_size].log_softmax(dim=-1)
    loss = F.nll_loss(out, batch.y[:batch.batch_size])
    loss.backward()
    optimizer.step()
  dist.barrier()

The training scripts for 2 nodes each with 2 GPUs are as follows:

# node 0:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --use_env --nnodes=2 --node_rank=0 --master_addr=xxx dist_train_sage_supervised.py

# node 1:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --use_env --nnodes=2 --node_rank=1 --master_addr=xxx dist_train_sage_supervised.py

Full code can be found in distributed training example.

License

Apache License 2.0