dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.34k stars 3k forks source link

[RFC] Support heterogeneous graphs in distributed training #2436

Closed zheng-da closed 3 years ago

zheng-da commented 3 years ago

Motivation

Starting from version 0.5, DGL supports distributed training. The detailed design of distributed training support in DGL can be found here. In short, DGL splits a large graph into multiple partitions with METIS and each machine is responsible for a partition. DGL partitions node/edge attributes of the graph according to the METIS partition as well. DGL provides a distributed KVStore to serve node/edge attributes in the distributed setting, which provides a pull API to read nodes/edges attributes and a push API to update node embeddings. The keys for accessing the attributes are node Ids and edge Ids. DGL provides DistGraph as a unified interface to access partitioned graph data in the cluster. The main functions provided by DistGraph are sampling neighbors of seed nodes (sample_neighbors) and accessing node/edge data (ndata and edata).

Currently, DGL only supports homogeneous graphs in distributed training. However, many graphs in the industry are heterogeneous, in which nodes/edges of different types have different feature dimensions and some nodes/edges may not have features. Unfortunately, we cannot simply extend the current heterogeneous graph design to the distributed setting for the sake of performance. The current heterogeneous graph design uses a separate CSR for each relation and invokes the graph kernel on each relation separately. This doesn't cause much overhead in full-batch/full-graph training but results in inefficient mini-batch computation because a mini-batch is usually small and splitting the small graph structure further into smaller pieces and invoking the graph kernels on them separately usually result in suboptimal performance (any Python overhead and parallelization overhead become more noticeable). In the context of distributed mini-batch training, the current graph structure design will also result in many small sampling requests to the cluster of machines.

Proposal

The main proposal is to define a graph structure format that works well for distributed heterogeneous graphs to address the problems above. We use a homogeneous graph structure to support distributed sampling and revise the graph partition algorithm to convert any heterogeneous graphs into this proposed format. We revise the distributed version of sample_neighbors to run on the proposed graph structure and allows outputting the sampled results in both a homogeneous graph format and a heterogeneous graph format. We extend the KVStore to support the storage of node/edge attributes of different types.

Graph structure

We will use a homogeneous graph format (e.g., a single CSR) to store a heterogeneous graph structure to enable efficient distributed sampling and mini-batch computation. In this graph structure, we need to identify the type and the per-type Id of each node and edge (per-type Id means that each node/edge type should have its own contiguous Id space starting from 0). There are three options to store node/edge types and per-type Ids for distributed sampling.

dist_hetero1

dist_hetero3

Given such an analysis, option 3 is likely to be the best option if we don't redesign the entire data structure for heterogeneous graphs.

Regardless of the data structure we choose, the chosen data structure should be hidden from users. In this way, we can switch to other data structures without changing users' code. This is possible because only the methods in DistGraph and sample_neighbors can access the data structure. When sample_neighbors returns the sampled results, we should give users an option to determine its format. Two formats we can choose: the current heterogeneous graph format in DGL and the homogeneous graph format with additional metadata (node/edge type and per-type node/edge Ids) as node/edge data.

Distributed sampling

If simply applying the same sampling algorithm (which samples a fixed number of neighbors from the neighborhood of a vertex regardless of the relation type), we only need to convert the sampled results in a homogeneous graph format or a heterogeneous graph format.

A user may sample a fixed number of neighbors for each relation type. In this case, we need to reimplement the sampling algorithm on this graph format. We can postpone this functionality for the next release.

Storage of node/edge attributes

We store node/edge attributes in KVStore. Currently, KVStore supports two Id spaces: node Id space and edge Id space. One is to store node data and the other is to store edge data. To support a heterogeneous graph, we need to extend KVStore to support an arbitrary number of Id spaces. These Id spaces can be created when KVStore servers are launched. This extension is fairly simple because the current KVStore already supports PartitionPolicy, which defines the partition policy for each Id space, and each tensor stored in KVStore is associated with a partition policy.

Programming interface

Even though we internally store a heterogeneous graph structure with a homogeneous graph format, we only expose a heterogeneous graph interface to users. That is, DistGraph will support the heterogeneous graph operations in DGLGraph. A node/edge in DistGraph is identified by a pair of per-type node/edge Id and node/edge type. If we use a heterogeneous graph format, a node/edge in a mini-batch is also identified by a pair of per-type node/edge Id and node/edge type; if we use a homogeneous graph format, a node/edge will be identified by its local Id. Regardless of the format, a node/edge in a mini-batch is associated with a pair of per-type node/edge Id and node/edge type that refers to the node/edge in the global graph.

Major works

To support heterogeneous graphs in distributed training as proposed, we need to modify a few components:

BarclayII commented 3 years ago

My gut feeling is that since this is related to the heterogeneous graph format design (option 1 you mentioned), you could put down the interfaces that distributed support requires efficient implementation from the heterogeneous graph data structure. Later on, when we revisit the heterogeneous graph design for cross-type aggregation we can relatively seamlessly incorporate the changes.

One thing that may potentially be an issue goes as follows. Since you will be implementing distributed heterogeneous graph with a homogeneous graph (if I get you right), when I change the heterogeneous graph data structure, would I need to completely overhaul the distributed heterogeneous graph code?

zheng-da commented 3 years ago

the sampling algorithm can output a mini-batch with a heterogeneous graph format or a homogeneous graph format. Right now the homogeneous graph format supports more efficient RGCN mini-batch computation. When we have a more efficient heterogeneous graph format, we can use the new format. Assuming the heterogeneous graph API doesn't change, it shouldn't impact users' code, if users specify what output format they want to use for mini-batch.