(GraphX): better partitioning strategies

cloudml / zen

Zen aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN.

Apache License 2.0

170 stars 75 forks source link

(GraphX): better partitioning strategies #38

Open hucheng opened 9 years ago

hucheng commented 9 years ago

There are four partitioning strategies in GraphX:

random hash
edge1D (src or dst)
edgePartition2D

Besides, we also implemented:

DBH (Degree-Based Hashing)
balanced label propagation from Facebook (http://stanford.edu/~jugander/papers/wsdm13-blp.pdf and https://code.facebook.com/posts/274771932683700/large-scale-graph-partitioning-with-apache-giraph/)
Bounded and Balanced Partitioner (two stages, edges belongs to vertex partition that has larger degree, and a re-balanced partitioner, details later. )

bhoppi commented 9 years ago

BBR has been tested. Experimental result suggests that BBR is slower than DBH when both partitioners work, but when the topic number is large or it has too many partitions, DBH can't work while BBR works well. Label propagation partitioner has basically close performance with BBR, but it has two weaknesses: 1. It is an iterative algorithm so it need re-partition again and again so it is slow; 2. It needs transfer a P * P matrix in each iteration which P is the partition number. When P is large, the rdd.collect() call can be very slow and cause the driver OOM.

hucheng commented 9 years ago

Good.

benmccann commented 8 years ago

Can you make these part of PartitionStrategy instead of implementing in a separate package? Seems like they belong in the graphx2 package...