zheng-da commented 3 years ago

Motivation

Starting from version 0.5, DGL supports distributed training. The detailed design of distributed training support in DGL can be found here. In short, DGL splits a large graph into multiple partitions with METIS and each machine is responsible for a partition. DGL partitions node/edge attributes of the graph according to the METIS partition as well. DGL provides a distributed KVStore to serve node/edge attributes in the distributed setting, which provides a pull API to read nodes/edges attributes and a push API to update node embeddings. The keys for accessing the attributes are node Ids and edge Ids. DGL provides DistGraph as a unified interface to access partitioned graph data in the cluster. The main functions provided by DistGraph are sampling neighbors of seed nodes (sample_neighbors) and accessing node/edge data (ndata and edata).

Currently, DGL only supports homogeneous graphs in distributed training. However, many graphs in the industry are heterogeneous, in which nodes/edges of different types have different feature dimensions and some nodes/edges may not have features. Unfortunately, we cannot simply extend the current heterogeneous graph design to the distributed setting for the sake of performance. The current heterogeneous graph design uses a separate CSR for each relation and invokes the graph kernel on each relation separately. This doesn't cause much overhead in full-batch/full-graph training but results in inefficient mini-batch computation because a mini-batch is usually small and splitting the small graph structure further into smaller pieces and invoking the graph kernels on them separately usually result in suboptimal performance (any Python overhead and parallelization overhead become more noticeable). In the context of distributed mini-batch training, the current graph structure design will also result in many small sampling requests to the cluster of machines.

Proposal

The main proposal is to define a graph structure format that works well for distributed heterogeneous graphs to address the problems above. We use a homogeneous graph structure to support distributed sampling and revise the graph partition algorithm to convert any heterogeneous graphs into this proposed format. We revise the distributed version of sample_neighbors to run on the proposed graph structure and allows outputting the sampled results in both a homogeneous graph format and a heterogeneous graph format. We extend the KVStore to support the storage of node/edge attributes of different types.

Graph structure

We will use a homogeneous graph format (e.g., a single CSR) to store a heterogeneous graph structure to enable efficient distributed sampling and mini-batch computation. In this graph structure, we need to identify the type and the per-type Id of each node and edge (per-type Id means that each node/edge type should have its own contiguous Id space starting from 0). There are three options to store node/edge types and per-type Ids for distributed sampling.

dist_hetero1

Option 1: We store the metadata (the node/edge type and per-type Ids) along with the graph structure. In this case, we will modify the CSR format and use six vectors to store a heterogeneous graph structure: indptr, node type, per-type node Id, index, edge type, per-type edge Id. Pros: The distributed sampling algorithms are able to access the metadata of nodes and edges directly. This allows the sampling algorithms more customized for heterogeneous graphs, such as the one proposed by the heterogeneous graph transformer paper. Cons: We need to modify the existing data structure. This will impact many components in the DGL core.
Option 2: We store node/edge types and per-type node/edge Ids in the KVStore. Pros: We don't need to modify the existing graph structure in the DGL core. Cons: The sampling algorithms cannot access the metadata of nodes and edges directly. This limits the sampling algorithms we can support. In addition, accessing the metadata from KVStore results in synchronized network access, which suffers from large network latencies.

dist_hetero3

Option 3: We encode the node/edge types and per-type node/edge Ids in the homogeneous Ids. This is doable if we shuffle node Ids and edge Ids so that all node/edge of the same type have contiguous Ids. When we use a CSR to store a heterogeneous graph, the column Ids in the adjacency list of a vertex is also sorted based on the node type. This is OK because we already shuffle node/edge Ids for distributed training so that all node/edge Ids in a partition are in a contiguous Id range. Pros: We can support any sampling algorithms for heterogeneous graphs without changing the existing data structure in the DGL core (we can get the node/edge type and per-type node/edge Ids directly). Cons: getting the metadata requires some computation on the node/edge Ids. However, computing such information may be faster than reading them somewhere else, which usually results in random memory access.

Given such an analysis, option 3 is likely to be the best option if we don't redesign the entire data structure for heterogeneous graphs.

Regardless of the data structure we choose, the chosen data structure should be hidden from users. In this way, we can switch to other data structures without changing users' code. This is possible because only the methods in DistGraph and sample_neighbors can access the data structure. When sample_neighbors returns the sampled results, we should give users an option to determine its format. Two formats we can choose: the current heterogeneous graph format in DGL and the homogeneous graph format with additional metadata (node/edge type and per-type node/edge Ids) as node/edge data.

Distributed sampling

If simply applying the same sampling algorithm (which samples a fixed number of neighbors from the neighborhood of a vertex regardless of the relation type), we only need to convert the sampled results in a homogeneous graph format or a heterogeneous graph format.

A user may sample a fixed number of neighbors for each relation type. In this case, we need to reimplement the sampling algorithm on this graph format. We can postpone this functionality for the next release.

Storage of node/edge attributes

We store node/edge attributes in KVStore. Currently, KVStore supports two Id spaces: node Id space and edge Id space. One is to store node data and the other is to store edge data. To support a heterogeneous graph, we need to extend KVStore to support an arbitrary number of Id spaces. These Id spaces can be created when KVStore servers are launched. This extension is fairly simple because the current KVStore already supports PartitionPolicy, which defines the partition policy for each Id space, and each tensor stored in KVStore is associated with a partition policy.

Programming interface

Even though we internally store a heterogeneous graph structure with a homogeneous graph format, we only expose a heterogeneous graph interface to users. That is, DistGraph will support the heterogeneous graph operations in DGLGraph. A node/edge in DistGraph is identified by a pair of per-type node/edge Id and node/edge type. If we use a heterogeneous graph format, a node/edge in a mini-batch is also identified by a pair of per-type node/edge Id and node/edge type; if we use a homogeneous graph format, a node/edge will be identified by its local Id. Regardless of the format, a node/edge in a mini-batch is associated with a pair of per-type node/edge Id and node/edge type that refers to the node/edge in the global graph.

Major works

To support heterogeneous graphs in distributed training as proposed, we need to modify a few components:

DistGraph: this class currently only supports the APIs for homogeneous graphs. We need to add the APIs of DGLGraph for heterogeneous graphs.
GraphPartitionBook: it now needs to support the mapping between global node/edge Ids and per-partition node/edge Ids for any specific node/edge type.
PartitionPolicy will be extended to support an arbitrary number of Id spaces for all node types and edge types.
partition_graph will be extended to support heterogeneous graphs. It will first convert a heterogeneous graph into a homogeneous graph. It will reshuffle node/edge Ids based on both partition results as well as node type and edge type so that all nodes/edges inside a partition are in a contiguous Id range; all nodes/edges of the same type in a partition are in a contiguous Id range. It will split the node attributes and edge attributes of different node types and edge types into the right partition.
sample_neighbors will sample neighbors as before. but it needs to convert the sampled results into a format specified by users. It should also support sampling a specified number of neighbors for each node/edge type.

BarclayII commented 3 years ago

My gut feeling is that since this is related to the heterogeneous graph format design (option 1 you mentioned), you could put down the interfaces that distributed support requires efficient implementation from the heterogeneous graph data structure. Later on, when we revisit the heterogeneous graph design for cross-type aggregation we can relatively seamlessly incorporate the changes.

One thing that may potentially be an issue goes as follows. Since you will be implementing distributed heterogeneous graph with a homogeneous graph (if I get you right), when I change the heterogeneous graph data structure, would I need to completely overhaul the distributed heterogeneous graph code?