Angel-ML / angel

A Flexible and Powerful Parameter Server for large-scale machine learning
Other
6.74k stars 1.6k forks source link

Error when predict with LDA #774

Open wqh17101 opened 5 years ago

wqh17101 commented 5 years ago

我的脚本如下:

sh ./angel-submit \
-Daction.type=predict \
-Dangel.app.submit.class=com.tencent.angel.ml.lda.LDARunner \
-Dml.model.class.name=com.tencent.angel.ml.lda.LDAModel \
-Dangel.predict.data.path=${dataPath} \
-Dangel.predict.out.path=${outPath} \
-Dangel.log.path=${logPath} \
-Dangel.load.model.path=${modelPath} \
-Dsave.doc.topic.distribution=true \
-Dsave.topic.word.distribution=true \
-Dsave.doc.topic=true \
-Dsave.word.topic=true \
-Dml.lda.word.num=33404450 \
-Dml.lda.topic.num=${topic_number} \
-Dsave.word.topic=true \
-Dml.epoch.num=300 \
-Dml.data.type=dummy \
-Dml.feature.index.range=1024 \
-Dangel.job.name=LDApredict \
-Dangel.am.memory.gb=20 \
-Dangel.worker.memory.gb=10 \
-Dangel.ps.memory.gb=2 \
-Dangel.staging.dir="hdfs://jr-hdfs//tmp/wangqinghua/lda/angel_test/stage" \
--queue datamin.default \
-Dangel.output.path.deleteonexist=true \
-Dangel.workergroup.number=100 \
-Dangel.ps.number=20 \
-Dangel.ps.cpu.vcores=15 \
-Dangel.am.cpu.vcores=28 \
-Dangel.am.java.opts=-Xmx8192m \
-Dangel.ps.java.opts=-Xm8192m

一开始运行地很正常, image

然后找不到错误的原因 查看log

image

2019-05-20 11:03:06,491 INFO [RMCommunicator Allocator] com.tencent.angel.master.deploy.ContainerAllocator: Received completed container:ContainerStatus: [ContainerId: container_e02_1550912047489_134265_01_000193, State: COMPLETE, Diagnostics: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
, ExitStatus: -105, ]

看起来似乎是内存的问题?? 训练的时候我使用了9000万的数据来训练,没有问题 预测的时候我只用前100万预测,为什么会有问题

wqh17101 commented 5 years ago

1万条数据也不行

Phoenixinfire commented 4 years ago

我的也是