cloudml / zen

Zen aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN.
Apache License 2.0
170 stars 75 forks source link

(LDA): different terminate condition for different vertices. #31

Open hucheng opened 9 years ago

hucheng commented 9 years ago

The insight is that the convergence speed of topics of some edges or word-topic distribution of some words is different, some converge earlier. For those converged edges/words, it is unnecessary to add them in the working set in the next iteration. The thing is how to determine an edge/word converge or not. A feasible solution is to use bhattacharyya coefficient (https://en.wikipedia.org/wiki/Bhattacharyya_distance) to compare the word-topic similarity of two consecutive iterations. The more similar, the more probability that that word is converged. We do not simply filter out the converged words based on a threshold value, instead, we use a probability to sample the edges of that word, the sample probability is negative-proportional to the similarity degree, and we also consider the time factor that the longer that an edge is not sampled, the new sample probability would be higher.