cloudml / zen

Zen aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN.
Apache License 2.0
170 stars 75 forks source link

(LDA): Unnecessary updates thus unnecessary network traffic happens in updateCounter. #29

Open hucheng opened 9 years ago

hucheng commented 9 years ago

updateCounter: the newly sampled topic in edges are send to word/doc vertex to update the word-topic/doc-topic table.

Unnecessary updates thus unnecessary network traffic happens in updateCounter if there is NO topic change in one edge (the newly sampled topic is the same as old topic). In other words, only those edges with topic change need to update the counter in vertices.

The solution is update the delta rather than value. The attribute of an edge (a token) is changed from topicId to (oldTopicId, newTopicId) pair, that means the oldTopicId in word-topic/doc-topic table needs to subtract the aggregated delta while newTopicId will be added by the aggregated delta.