cloudml / zen

Zen aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN.
Apache License 2.0
170 stars 75 forks source link

(LDA) Could you please give more detail about tested dataset and configuration? #67

Closed razrLeLe closed 7 years ago

razrLeLe commented 7 years ago

I tried with 10k-80k doc-word dataset, the program worked, but when I tested with 100k-160k doc-word dataset the program failed with exception:

Exception in thread "main" java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$40.apply(RDD.scala:1027) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$40.apply(RDD.scala:1027) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1027) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) at org.apache.spark.rdd.RDD$$anonfun$max$1.apply(RDD.scala:1410) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.max(RDD.scala:1409) at com.github.cloudml.zen.ml.clustering.DistributedLDAModel.save(LDAModel.scala:242) at com.github.cloudml.zen.ml.clustering.DistributedLDAModel.save(LDAModel.scala:222) at com.github.cloudml.zen.examples.ml.LDADriver$.runTraining(LDADriver.scala:139) at com.github.cloudml.zen.examples.ml.LDADriver$.main(LDADriver.scala:114) at com.github.cloudml.zen.examples.ml.LDADriver.main(LDADriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I am wondering whether it is necessary to specify some configuration or meet up with some equipment, so could you please tell me more detail about your configuration when works with billions of documents?

And thanks so much for your awesome work!

bhoppi commented 7 years ago

Is your input corpus in LIBSVM format? And what's your command arguments used?

razrLeLe commented 7 years ago

Yes, I used libsvm format, and the submit command is as follows: spark-submit --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://goyoo/user/yuyue/log --executor-cores 4 --num-executors 21 --driver-memory 4G --executor-memory 4G --master yarn-client --class com.github.cloudml.zen.examples.ml.LDADriver zen-examples-0.3-SNAPSHOT-spark1.6.1.jar -numPartitions=21 -LDAAlgorithm=LightLDA -numThreads=16 -numTopics=500 -alpha=0.01 -beta=0.01 -alphaAS=1.0 -totalIter=1500 hdfs://goyoo/user/yuyue/10w_doc.libsvm hdfs://goyoo/user/yuyue/zen_10w_result

bhoppi commented 7 years ago
  1. The wrong: -numThreads means number of threads allocated for each partition, so this parameter must be <= --executor-cores, otherwise the job won't start.
  2. May need tune: I don't know how much your corpus is, but if your corpus is very big, --executor-memory 4G may be not enough and you may need increase it if OOM happens.
  3. Other suggestions: -LDAAlgorithm=ZenLDA is the fastest algorithm among all the LDA implementations; -chkptinterval=100 (for example) is needed to do checkpoints every 100 iterations, otherwise your job will be very slow after hundreds of iterations (because driver memory is eaten up by the very long RDD lineage information)
razrLeLe commented 7 years ago

@bhoppi Thanks so much for your help, I finally find out there is something wrong with pretreatment of the corpus, which caused every word count of every document is zero, and then got the exception.