dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.55k stars 3.02k forks source link

The partition strategy of distDGL seems like a vertex cut, different from its thesis and paper demonstrate . #2773

Closed Zarca closed 3 years ago

Zarca commented 3 years ago

Hi, after checking each distributed graph partition of DGL(0.6), i believe that it adopt a vertex cut strategy, and each edge will only be assigned to just one partition, and the vertices may be some replications. However, when I'm looking into the official paper , figure 4b: https://arxiv.org/abs/2010.05337 I found that it said , the DGL use an edge cut strategy, which is far different from the vertex cut behavior. Any explanation?

BarclayII commented 3 years ago

We used METIS partitioning, which minimizes the edge cut between partitions.

The duplicated vertices are actually halo nodes, and vertex feature data are not duplicated. Therefore it is not vertex cut.

Please follow up if you have further questions. Thanks!

Zarca commented 3 years ago

Thanks for the info, but i noticed that the edges of all subgraph have no intersections. And the sum of these subgraph's edge number equals the full Graph's edge number. If it were an edge cut , I supposed that there must be some edge replicas spanning different parts, however, as I referred, all subgraphs did have not any edge replication. That is the biggest question confusing me.

Zarca commented 3 years ago

For instance, I use DGL to part a graph into 2 partitions, The edges of the full graph is (0,1)(0,2),(0,3),(0,4),(0,5). After the cutting, the part_0 contains (0,1)(0,3),(0,4), part_1 contains (0,2),(0,5) I think it was extreamely like a vertex cut behavior

jermainewang commented 3 years ago

Hi Zarca, we follow the definitions of vertex-cut and edge-cut in the GraphLab/PowerGraph paper. Below is a figure in the PowerGraph paper.

image

You can see that in edge-cut, cut edges have ghost vertices and they are stored on two partitions. By contrast, vertex-cut does not duplicate edge storage (each edge stored in one partition) but creates master/mirror copies for cut vertices.

So what does DistDGL do?

In your example, the graph is (0,1)(0,2),(0,3)(0,4),(0,5). Suppose METIS gives an assignment: vertex 0, 1, 2 to partition 0, vertex 3, 4, 5 to partition 1. Then the edge assignment will be: (0,1),(1,0),(0,2),(2,0),(3,0),(4,0),(5,0) to partition 0, (0,3),(0,4),(0,5) to partition 1. This makes vertex 3, 4, 5 the ghost(halo) vertices on partition 0 and vertex 0 the ghost(halo) vertices on partition 1. For vertex data, to emphasize once again, they are stored according to the METIS assignment, thus not being duplicated. For edge data, DistDGL duplicates them when converting the graph to bi-directed and may store them in two partitions if the edges are across partitions. These characteristics make it more edge-cut than vertex-cut.

Zarca commented 3 years ago

hi @jermainewang , Thanks for the info. So i think it is a mixture partition mixing the style of edge cut(METIS dose) with the style of (vertex cut), is that true? But the METIS's default node assignments seems to be unbalanced when taking the train workload into consideration, could u please explain what is the improvements the distDGL have made on the default METIS? As the official paper https://arxiv.org/abs/2010.05337 referred that it has done some constraint over the original METIS partition. And I still want to figure out the details behind: Is METIS applying an even nodes assignment? And the distDGL take more things into consideration like: train nodes' evenly distribution?

zheng-da commented 3 years ago

I think DGL uses edge cut. However, an edge only appears in one of the partitions. This works OK on a directed graph. An edge always belongs to its destination node. Unfortunately, this leads to some problems. For example, we cannot compute out-degree of nodes correctly. We'll change it and strictly follow the edge cut illustrated in the figure above.

METIS balances the number of nodes in each partition by default. If the training set is the subset of nodes, this default balancing strategy isn't enough. It doesn't understand what are the nodes to balance. Therefore, we allow users to specify the node type of the nodes in a graph and balance the nodes of different types. For example, if we want to balance the number of training nodes, we can identify training nodes as a type and non-training nodes as another type. In this way, we will know how to balance the training nodes during partitioning.

Zarca commented 3 years ago

Oh, now I see the ideas behind the distDGL, since the directed edge can be kept instead of cutting it. As far as I know , the METIS can take many strategies in its 3 phases: coarsening, partition, uncoarsening. For instance, the 1st phase (coarsening) can take RM、HEM、LEM and the HCM strategy and the other 2 phases also have many choices, just as the paper referenced by distDGL said: https://www.researchgate.net/publication/242479489_Kumar_V_A_Fast_and_High_Quality_Multilevel_Scheme_for_Partitioning_Irregular_Graphs_SIAM_Journal_on_Scientific_Computing_201_359-392

Could you please show me what are the strategies that the original METIS inside distDGL chooses, if i run a default METIS partition in distDGL ?(no train_balance and other flags on)

zheng-da commented 3 years ago

i'm not sure what are the strategies used by METIS for each phase. we just use the default strategies except that we put constraints to balance the partitions.

Zarca commented 3 years ago

So,the whole things is that the distDGL run original METIS and get a partition, after that, the DGL repartition it based on train nodes' balance.So they are just 2 separate phases in sequence. Is that true? What i meant here is that the distDGL did not modify the original METIS.

zheng-da commented 3 years ago

DistDGL doesn't modify the original METIS strategies. During the runtime, we may adjust the training nodes to make it more balanced.