Angel-ML / angel

A Flexible and Powerful Parameter Server for large-scale machine learning
Other
6.73k stars 1.61k forks source link

可以提供一些LDA参数的建议么 #766

Open wqh17101 opened 5 years ago

wqh17101 commented 5 years ago
sh ./angel-submit \
-Daction.type=train \
-Dangel.app.submit.class=com.tencent.angel.ml.lda.LDARunner \
-Dml.model.class.name=com.tencent.angel.ml.lda.LDAModel \
-Dangel.train.data.path="hdfs://jr-hdfs//tmp/wangqinghua/lda/angel_test/data/trainData" \
-Dangel.log.path="hdfs://jr-hdfs//tmp/wangqinghua/lda/angel_test/log" \
-Dangel.save.model.path="hdfs://jr-hdfs//tmp/wangqinghua/lda/angel_test/model" \
-Dsave.doc.topic.distribution=true \
-Dsave.topic.word.distribution=true \
-Dsave.doc.topic=true \
-Dsave.word.topic=true \
-Dml.lda.word.num=33404450 \
-Dml.lda.topic.num=100 \
-Dsave.word.topic=true \
-Dml.epoch.num=20 \
-Dml.data.type=dummy \
-Dml.feature.index.range=1024 \
-Dangel.job.name=LDAtest \
-Dangel.am.memory.gb=20 \
-Dangel.worker.memory.gb=2 \
-Dangel.ps.memory.gb=2 \
-Dangel.staging.dir="hdfs://jr-hdfs//tmp/wangqinghua/lda/angel_test/stage" \
--queue datamin.default \
-Dangel.output.path.deleteonexist=true \
-Dangel.workergroup.number=20 \
-Dangel.ps.number=20 \
-Dangel.ps.cpu.vcores=15 \
-Dangel.am.cpu.vcores=28 \
-Dangel.am.java.opts -Xmx8192m \
-Dangel.ps.java.opts -Xm8192m

目前参数如下,总训练数据为9000W条,大约1.6T 希望能提供一些建议

wqh17101 commented 5 years ago

服务器配置,5.5T内存,1800 vCORES

wqh17101 commented 5 years ago

image

leleyu commented 5 years ago

I think you can use more workers (increase the angel.workergroup.number) A 1.6TB data may cost more than 100 workers with each with 10GB memory.

wqh17101 commented 5 years ago

成功跑起来了,谢谢 @leleyu 所以就是worker_num*worker_memory ≈ data size 就可以是么 其他的参数还需要调整么
想要更快的跑