(LDA)How to set up scale related parameters?

cloudml / zen

Zen aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN.

Apache License 2.0

170 stars 75 forks source link

(LDA)How to set up scale related parameters? #48

Closed ylqfp closed 8 years ago

ylqfp commented 8 years ago

Hi, I'm testing LDA for large scale dataset, billion docs*million words. However, spark originated lda always HANG, then i found ZEN. However, I've not found parameter setup guide for lda, except simple description in source code. My question is:

There are several parameters relate to SCALE, is there a guide for setting them up?
As for parition number, how to choose for better parallelization? Thanks!

witgo commented 8 years ago

ping @bhoppi

bhoppi commented 8 years ago

I think parameters relate to scale are -numPartitions and -numThreads. There is no general best strategy for all corpus, so you need to try. But there are some principles to set these paras: for partitions number: setting it as less as possible, but to insure that blocks must be smaller than 2GB; for threads number: setting it appropriately, not to large or too small. @hucheng Can you share the slides used for Kaiyuanshe about how to set up ZenLDA parameters? I don't have it.

hucheng commented 8 years ago

Prepare Zen Jar:
Spark-submit commands:
Commands arguments:
Tips on LDA usage:

ylqfp commented 8 years ago

Thanks So Much!!! Closed. @hucheng