cloudml / zen

Zen aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN.
Apache License 2.0
170 stars 75 forks source link

(LDA): scalability issue in updateCounter. #30

Open hucheng opened 9 years ago

hucheng commented 9 years ago

Current updateCounter is implemented via aggregateMessages. It will create an dense arrays with length of vertices in each partition, and each element is a sparseVector. When the number of vertices in one partition is huge (consider 1B vertices), it cannot be hold in memory. This can be solved by:

  1. better graph partition approach such that the number of vertices in a partition is limited even when the graph (input data) is super large.
  2. use aggregateByKey. There are several advantages against aggregateMessages: (1). For each edge, aggregateMessages will new a sparseVector (edge attribute is an array), and new a sparseVector that is the result of sparseVector+sparseVector. (2). Why not reduceByKey is because it seqOp definition (U, V) => U does not needs V to be the same type with U, thus unnecessary to new sparseVector from edge attribute array. Besides, to avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U. (3). The side-effect of aggregateByKey is that it needs a sort-phase if sort-based shuffle is used.
bhoppi commented 9 years ago

The 2nd issue is solved now. We don't use aggregateMessages nor aggregateByKey. We do it by ourselves. The implementation is like aggregateMessages so no sorting needed, and it is done in-place so not much memory needed and doesn't need create many new sparseVectors.

hucheng commented 9 years ago

Great. The same, please share the pr here.