cloudml / zen

Zen aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN.
Apache License 2.0
170 stars 75 forks source link

(LDA) Multi-thread GraphX implementation #41

Closed bhoppi closed 8 years ago

bhoppi commented 9 years ago

Now the implementation of GraphX has serious scalability issues. The reason is that it's RDD data structure is specially optimized for join operations (edges with vertices, inner edges join, outer vertices join, etc.), that causes its data are not loaded block by block like other RDDs, but one partition as a whole which OOM occurs often if many partitions loaded at the same time. So our solution to this issue is, constraining the partition number loaded at the same time, and processing each partition using multi-thread techniques.