Open kimi132009 opened 5 years ago
麻烦贴一下完整的错误日志,谢谢。
ps数量和memory应该是够了。 启动命令有一些参数不需要设置
$ANGEL_HOME/bin/angel-submit --action.type train --angel.app.submit.class com.tencent.angel.ml.GBDT.GBDTRunner --angel.train.data.path $input_path --angel.save.model.path $model_path --angel.log.path $log_path --ml.data.type libsvm --ml.feature.index.range $featureNum --ml.gbdt.tree.num 3 --ml.gbdt.tree.depth 3 --ml.gbdt.split.num 10 --ml.data.validate.ratio 0.1 --ml.gbdt.sample.ratio 1 --ml.learn.rate $learnRate --angel.workergroup.number $workerNumber --angel.worker.memory.gb $workerMemory --angel.worker.cpu.vcores $workerCpu --angel.task.data.storage.level $storageLevel --angel.task.memorystorage.max.gb $taskMemory --angel.ps.number $PSNumber --angel.ps.memory.gb $PSMemory --angel.am.memory.gb 4 --angel.staging.dir /user/weibo_bigdata_push/angel_stage --angel.tmp.output.path.prefix /user/weibo_bigdata_push/angel_stage
程序一直在跑,从ps端log日志中没有看到异常,但是在worker_group的log日志当中会有 2019-06-12 11:42:28,616 WARN [Worker Heartbeat] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: Error reading the stream java.io.IOException: No such process 2019-06-12 12:11:28,345 WARN [Worker Heartbeat] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: Error reading the stream java.io.IOException: No such process
而整体的application的log日志会报 2019-06-12 11:00:31,331 WARN [DataStreamer for file /user/weibo_bigdata_push/angel_stage/application_1559141668369_2206463_9c1b4b73-128f-4bbc-bb6a-ad9b1095b2e0/app/_tmp.psmeta_42926427531311035] org.apache.hadoop.hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) at java.lang.Thread.join(Thread.java:1326) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546) 2019-06-12 11:17:12,026 WARN [DataStreamer for file /user/weibo_bigdata_push/angel_stage/application_1559141668369_2206463_9c1b4b73-128f-4bbc-bb6a-ad9b1095b2e0/app/_tmp.psmeta_42927428265776684] org.apache.hadoop.hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) at java.lang.Thread.join(Thread.java:1326) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
但是它还会继续retry。@bluesjjw.
这边照你更改的几个参数之后可以继续运行了~不过我想请问我这边没有获得GradPairs报错 2019-06-12 16:26:02,865 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: **Current phase: NEW_TREE, clock[3]** 2019-06-12 16:26:02,865 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: ------Create new tree------ 2019-06-12 16:26:02,868 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: ------Calculate grad pairs------ 2019-06-12 16:26:02,869 ERROR [pool-5-thread-1] com.tencent.angel.worker.task.Task: task runner error java.lang.ArrayIndexOutOfBoundsException: 0 at com.tencent.angel.ml.GBDT.algo.GBDTController.calGradPairs(GBDTController.java:242) at com.tencent.angel.ml.GBDT.algo.GBDTController.createNewTree(GBDTController.java:462) at com.tencent.angel.ml.GBDT.GBDTLearner.train(GBDTLearner.scala:132) at com.tencent.angel.ml.GBDT.GBDTTrainTask.train(GBDTTrainTask.scala:47) at com.tencent.angel.ml.core.TrainTask.run(TrainTask.scala:50) at com.tencent.angel.worker.task.Task.runUser(Task.java:92) at com.tencent.angel.worker.task.Task.run(Task.java:68) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 2019-06-12 16:26:02,914 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: worker failed message : taskid=task_1, state=FAILED, diagnostics=[task runner error: java.lang.ArrayIndexOutOfBoundsException: 0 at com.tencent.angel.ml.GBDT.algo.GBDTController.calGradPairs(GBDTController.java:242) at com.tencent.angel.ml.GBDT.algo.GBDTController.createNewTree(GBDTController.java:462) at com.tencent.angel.ml.GBDT.GBDTLearner.train(GBDTLearner.scala:132) at com.tencent.angel.ml.GBDT.GBDTTrainTask.train(GBDTTrainTask.scala:47) at com.tencent.angel.ml.core.TrainTask.run(TrainTask.scala:50) at com.tencent.angel.worker.task.Task.runUser(Task.java:92) at com.tencent.angel.worker.task.Task.run(Task.java:68) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) ], send it to appmaster success
这里是要加什么参数嘛?
看起来好像是数据读入的问题,完整的日志?
嗯嗯,稍等,我也怀疑是这样,我先拿你们的提供的样例数据data1.libsvm跑一下看看会不会出现同样的bug
19/06/13 11:27:37 INFO utils.AngelRunJar: angelHomePath conf path=/data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//conf/angel-site.xml
19/06/13 11:27:37 INFO utils.AngelRunJar: load system config file success
19/06/13 11:27:38 INFO utils.AngelRunJar: jars loaded: file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/arpack_combined_all-0.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/all-1.1.2.pom,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/core-1.1.2.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-armhf-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-i686-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-x86_64-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-armhf-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-i686-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-x86_64-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-annotations-2.9.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-core-2.9.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-core-asl-1.8.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-databind-2.9.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-jaxrs-1.8.3.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-mapper-asl-1.8.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-module-paranamer-2.9.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-module-scala_2.11-2.9.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-xc-1.8.3.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-ast_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-core_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-jackson_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-scalap_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netty-all-4.1.1.Final.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-mllib-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-tools-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/scala-reflect-2.11.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/memory-0.8.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/sketches-core-0.8.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/commons-pool-1.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/kryo-shaded-4.0.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/kryo-serializers-0.42.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/scala-library-2.11.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-core-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-psf-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/fastutil-7.1.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/sizeof-0.3.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/minlog-1.3.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/breeze_2.11-0.13.jar
19/06/13 11:27:38 INFO utils.AngelRunJar: angel python file: null
19/06/13 11:27:38 INFO utils.UGITools: UGI_PROPERTY_NAME is null
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data_new/weichao/bin_round/angel/angel-2.0.0-bin/lib/slf4j-log4j12-1.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data_new/weichao/bin_round/angel/angel-2.0.0-bin/lib/spark-on-angel-mllib-2.0.0-alpha-dep.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/06/13 11:27:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/06/13 11:27:38 INFO utils.AngelRunJar: submitClass: com.tencent.angel.ml.GBDT.GBDTRunner
19/06/13 11:27:38 INFO model.PSModel: After training matrix gbdt.node.predict will be saved to /user/weibo_bigdata_push/weichao/angel/fm/ori_model_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190613_1127
19/06/13 11:27:38 INFO model.PSModel: After training matrix gbdt.split.value will be saved to /user/weibo_bigdata_push/weichao/angel/fm/ori_model_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190613_1127
19/06/13 11:27:38 INFO model.PSModel: After training matrix gbdt.split.feature will be saved to /user/weibo_bigdata_push/weichao/angel/fm/ori_model_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190613_1127
19/06/13 11:27:38 INFO utils.UGITools: UGI_PROPERTY_NAME is null
19/06/13 11:27:40 INFO client.AngelClient: running mode = ANGEL_PS_WORKER
19/06/13 11:27:40 INFO utils.HdfsUtil: tmp output dir is hdfs:///user/weibo_bigdata_push/angel_stage/application_1559141668369_2580449_66ff9bd3-37ea-418f-ab5f-efbf2fece6b1
19/06/13 11:27:40 INFO utils.HdfsUtil: tmp output dir is hdfs:///user/weibo_bigdata_push/angel_stage/application_1559141668369_2580449_f143c0aa-dc06-4bbf-a9ef-94b42115f165
19/06/13 11:27:40 INFO client.AngelClient: angel.tmp.output.path=hdfs:/user/weibo_bigdata_push/angel_stage/application_1559141668369_2580449_66ff9bd3-37ea-418f-ab5f-efbf2fece6b1
19/06/13 11:27:40 INFO client.AngelClient: internal state file is hdfs:/user/weibo_bigdata_push/angel_stage/application_1559141668369_2580449_f143c0aa-dc06-4bbf-a9ef-94b42115f165/state
19/06/13 11:27:40 INFO yarn.AngelYarnClient: default FileSystem: hdfs://ns3-backup
19/06/13 11:27:40 INFO yarn.AngelYarnClient: libjarsDir=/user/weibo_bigdata_push/angel_stage/weibo_bigdata_push/.staging/application_1559141668369_2580449/libjars
19/06/13 11:27:40 INFO yarn.AngelYarnClient: libjars=file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/arpack_combined_all-0.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/all-1.1.2.pom,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/core-1.1.2.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-armhf-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-i686-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-x86_64-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-armhf-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-i686-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-x86_64-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-annotations-2.9.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-core-2.9.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-core-asl-1.8.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-databind-2.9.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-jaxrs-1.8.3.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-mapper-asl-1.8.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-module-paranamer-2.9.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-module-scala_2.11-2.9.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-xc-1.8.3.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-ast_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-core_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-jackson_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-scalap_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netty-all-4.1.1.Final.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-mllib-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-tools-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/scala-reflect-2.11.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/memory-0.8.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/sketches-core-0.8.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/commons-pool-1.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/kryo-shaded-4.0.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/kryo-serializers-0.42.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/scala-library-2.11.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-core-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-psf-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/fastutil-7.1.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/sizeof-0.3.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/minlog-1.3.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/breeze_2.11-0.13.jar
AppMaster capability = <memory:4096, vCores:1, gcores:0>
19/06/13 11:27:48 INFO yarn.AngelYarnClient: Command to launch container for ApplicationMaster is : $JAVA_HOME/bin/java -Dlog4j.configuration=log/angel.properties -Dlog4j.logger.com.tencent.ml=DEBUG -Dyarn.app.container.log.dir=
19/06/13 11:36:00 INFO client.AngelClient: stop the application 19/06/13 11:36:00 INFO client.AngelClient: master is not null, send stop command to Master, stateCode=0 19/06/13 11:36:00 FATAL utils.AngelRunJar: submit job failed com.tencent.angel.exception.AngelException: app run failed, detail is killed and failed workergroup is over tolerate 0.0There are some Workers failed failed workergroups: WorkerGroup_3. Worker_3_3 failed. WorkerAttempt_3_3_0 failed due to: taskid=task_3, state=FAILED, diagnostics=[task runner error: java.lang.ArrayIndexOutOfBoundsException: 0 at com.tencent.angel.ml.GBDT.algo.GBDTController.calGradPairs(GBDTController.java:242) at com.tencent.angel.ml.GBDT.algo.GBDTController.createNewTree(GBDTController.java:462) at com.tencent.angel.ml.GBDT.GBDTLearner.train(GBDTLearner.scala:132) at com.tencent.angel.ml.GBDT.GBDTTrainTask.train(GBDTTrainTask.scala:47) at com.tencent.angel.ml.core.TrainTask.run(TrainTask.scala:50) at com.tencent.angel.worker.task.Task.runUser(Task.java:92) at com.tencent.angel.worker.task.Task.run(Task.java:68) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) ] Detail Worker Log URL:http://10.60.17.50:9042/node/containerlogs/container_e18_1559141668369_2580449_01_000007/angel/syslog/?start=0 WorkerAttempt_3_3_1 failed due to: taskid=task_3, state=FAILED, diagnostics=[task runner error: java.lang.ArrayIndexOutOfBoundsException: 0 at com.tencent.angel.ml.GBDT.algo.GBDTController.calGradPairs(GBDTController.java:242) at com.tencent.angel.ml.GBDT.algo.GBDTController.createNewTree(GBDTController.java:462) at com.tencent.angel.ml.GBDT.GBDTLearner.train(GBDTLearner.scala:132) at com.tencent.angel.ml.GBDT.GBDTTrainTask.train(GBDTTrainTask.scala:47) at com.tencent.angel.ml.core.TrainTask.run(TrainTask.scala:50) at com.tencent.angel.worker.task.Task.runUser(Task.java:92) at com.tencent.angel.worker.task.Task.run(Task.java:68) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) ] Detail Worker Log URL:http://10.39.67.165:9042/node/containerlogs/container_e18_1559141668369_2580449_01_000009/angel/syslog/?start=0 WorkerAttempt_3_3_2 failed due to: taskid=task_3, state=FAILED, diagnostics=[task runner error: java.lang.ArrayIndexOutOfBoundsException: 0 at com.tencent.angel.ml.GBDT.algo.GBDTController.calGradPairs(GBDTController.java:242) at com.tencent.angel.ml.GBDT.algo.GBDTController.createNewTree(GBDTController.java:462) at com.tencent.angel.ml.GBDT.GBDTLearner.train(GBDTLearner.scala:132) at com.tencent.angel.ml.GBDT.GBDTTrainTask.train(GBDTTrainTask.scala:47) at com.tencent.angel.ml.core.TrainTask.run(TrainTask.scala:50) at com.tencent.angel.worker.task.Task.runUser(Task.java:92) at com.tencent.angel.worker.task.Task.run(Task.java:68) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) ] Detail Worker Log URL:http://10.60.15.76:9042/node/containerlogs/container_e18_1559141668369_2580449_01_000021/angel/syslog/?start=0 WorkerAttempt_3_3_3 failed due to: taskid=task_3, state=FAILED, diagnostics=[task runner error: java.lang.ArrayIndexOutOfBoundsException: 0 at com.tencent.angel.ml.GBDT.algo.GBDTController.calGradPairs(GBDTController.java:242) at com.tencent.angel.ml.GBDT.algo.GBDTController.createNewTree(GBDTController.java:462) at com.tencent.angel.ml.GBDT.GBDTLearner.train(GBDTLearner.scala:132) at com.tencent.angel.ml.GBDT.GBDTTrainTask.train(GBDTTrainTask.scala:47) at com.tencent.angel.ml.core.TrainTask.run(TrainTask.scala:50) at com.tencent.angel.worker.task.Task.runUser(Task.java:92) at com.tencent.angel.worker.task.Task.run(Task.java:68) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) ] Detail Worker Log URL:http://10.60.17.34:9042/node/containerlogs/container_e18_1559141668369_2580449_01_000024/angel/syslog/?start=0
at com.tencent.angel.client.AngelClient.waitForCompletion(AngelClient.java:390)
at com.tencent.angel.ml.GBDT.GBDTRunner.train(GBDTRunner.scala:46)
at com.tencent.angel.ml.core.MLRunner$class.submit(MLRunner.scala:88)
at com.tencent.angel.ml.GBDT.GBDTRunner.submit(GBDTRunner.scala:26)
at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:69)
at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:56)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1700)
at com.tencent.angel.utils.AngelRunJar.submit(AngelRunJar.java:56)
at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:40)
/user/weibo_bigdata_push/weichao/angel/fm/train_log_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190613_1127 /user/weibo_bigdata_push/weichao/angel/fm/ori_model_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190613_1127 19/06/13 11:36:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
我觉得是数据输入问题,所以我换了一批其他的数据,包括angel自带的数据。但是会出现新的异常字符串的bug。
19/06/13 19:34:24 INFO utils.AngelRunJar: angelHomePath conf path=/data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//conf/angel-site.xml
19/06/13 19:34:24 INFO utils.AngelRunJar: load system config file success
19/06/13 19:34:24 INFO utils.AngelRunJar: jars loaded: file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/arpack_combined_all-0.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/all-1.1.2.pom,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/core-1.1.2.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-armhf-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-i686-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-x86_64-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-armhf-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-i686-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-x86_64-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-annotations-2.9.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-core-2.9.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-core-asl-1.8.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-databind-2.9.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-jaxrs-1.8.3.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-mapper-asl-1.8.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-module-paranamer-2.9.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-module-scala_2.11-2.9.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-xc-1.8.3.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-ast_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-core_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-jackson_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-scalap_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netty-all-4.1.1.Final.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-mllib-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-tools-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/scala-reflect-2.11.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/memory-0.8.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/sketches-core-0.8.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/commons-pool-1.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/kryo-shaded-4.0.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/kryo-serializers-0.42.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/scala-library-2.11.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-core-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-psf-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/fastutil-7.1.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/sizeof-0.3.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/minlog-1.3.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/breeze_2.11-0.13.jar
19/06/13 19:34:24 INFO utils.AngelRunJar: angel python file: null
19/06/13 19:34:24 INFO utils.UGITools: UGI_PROPERTY_NAME is null
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data_new/weichao/bin_round/angel/angel-2.0.0-bin/lib/slf4j-log4j12-1.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data_new/weichao/bin_round/angel/angel-2.0.0-bin/lib/spark-on-angel-mllib-2.0.0-alpha-dep.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/06/13 19:34:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/06/13 19:34:24 INFO utils.AngelRunJar: submitClass: com.tencent.angel.ml.GBDT.GBDTRunner
19/06/13 19:34:25 INFO model.PSModel: After training matrix gbdt.node.predict will be saved to /user/weibo_bigdata_push/weichao/angel/fm/ori_model_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190613_1934
19/06/13 19:34:25 INFO model.PSModel: After training matrix gbdt.split.value will be saved to /user/weibo_bigdata_push/weichao/angel/fm/ori_model_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190613_1934
19/06/13 19:34:25 INFO model.PSModel: After training matrix gbdt.split.feature will be saved to /user/weibo_bigdata_push/weichao/angel/fm/ori_model_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190613_1934
19/06/13 19:34:25 INFO utils.UGITools: UGI_PROPERTY_NAME is null
19/06/13 19:34:26 INFO client.AngelClient: running mode = ANGEL_PS_WORKER
19/06/13 19:34:26 INFO utils.HdfsUtil: tmp output dir is hdfs:///user/weibo_bigdata_push/angel_stage/application_1559141668369_2639119_313083dd-bf06-4896-bfed-10f01ffff717
19/06/13 19:34:26 INFO utils.HdfsUtil: tmp output dir is hdfs:///user/weibo_bigdata_push/angel_stage/application_1559141668369_2639119_90eb86aa-2784-4067-9ed9-989067866c8b
19/06/13 19:34:26 INFO client.AngelClient: angel.tmp.output.path=hdfs:/user/weibo_bigdata_push/angel_stage/application_1559141668369_2639119_313083dd-bf06-4896-bfed-10f01ffff717
19/06/13 19:34:26 INFO client.AngelClient: internal state file is hdfs:/user/weibo_bigdata_push/angel_stage/application_1559141668369_2639119_90eb86aa-2784-4067-9ed9-989067866c8b/state
19/06/13 19:34:26 INFO yarn.AngelYarnClient: default FileSystem: hdfs://ns3-backup
19/06/13 19:34:26 INFO yarn.AngelYarnClient: libjarsDir=/user/weibo_bigdata_push/angel_stage/weibo_bigdata_push/.staging/application_1559141668369_2639119/libjars
19/06/13 19:34:26 INFO yarn.AngelYarnClient: libjars=file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/arpack_combined_all-0.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/all-1.1.2.pom,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/core-1.1.2.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-armhf-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-i686-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_ref-linux-x86_64-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-armhf-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-i686-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netlib-native_system-linux-x86_64-1.1-natives.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-annotations-2.9.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-core-2.9.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-core-asl-1.8.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-databind-2.9.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-jaxrs-1.8.3.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-mapper-asl-1.8.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-module-paranamer-2.9.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-module-scala_2.11-2.9.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/jackson-xc-1.8.3.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-ast_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-core_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-jackson_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/json4s-scalap_2.11-3.6.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/netty-all-4.1.1.Final.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-mllib-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-tools-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/scala-reflect-2.11.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/memory-0.8.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/sketches-core-0.8.1.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/commons-pool-1.6.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/kryo-shaded-4.0.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/kryo-serializers-0.42.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/scala-library-2.11.8.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-core-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/angel-ps-psf-2.0.0-alpha.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/fastutil-7.1.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/sizeof-0.3.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/minlog-1.3.0.jar,file:///data_new/weichao/bin_round/angel/angel-2.0.0-bin/bin/..//lib/breeze_2.11-0.13.jar
AppMaster capability = <memory:4096, vCores:1, gcores:0>
19/06/13 19:34:37 INFO yarn.AngelYarnClient: Command to launch container for ApplicationMaster is : $JAVA_HOME/bin/java -Dlog4j.configuration=log/angel.properties -Dlog4j.logger.com.tencent.ml=DEBUG -Dyarn.app.container.log.dir=
这两个异常在相同脚本,相同数据。不同次运行的时候交替出现。多个数据我都试了一下。所以我觉得应该可以排除数据本身的问题。我在想如果是数据传输问题,我该改哪里?我尝试加了—ml.model.type T T_DOUBLE_DENSE,但是还是不行。
@bluesjjw
你跑的是哪个版本? 数据是libsvm格式?
2.0.0,数据是标准libsvm啊,我还拿这个数据跑了一下fm,是没问题的
好像也不是数据的问题,有worker的日志么?
另外,推荐用现在最新的版本2.2.0
2019-06-14 14:09:37,220 INFO [main] com.tencent.angel.psagent.PSAgent: PSAgent get matrices from master,16 2019-06-14 14:09:37,275 INFO [main] com.tencent.angel.psagent.matrix.transport.MatrixTransportClient: Use nio channel 2019-06-14 14:09:37,286 INFO [main] com.tencent.angel.psagent.matrix.transport.MatrixTransportClient: ByteOrder.nativeOrder()=LITTLE_ENDIAN 2019-06-14 14:09:37,311 INFO [main] com.tencent.angel.worker.Worker: Init data block manager 2019-06-14 14:09:37,311 INFO [main] com.tencent.angel.worker.Worker: Init and start worker rpc server 2019-06-14 14:09:37,336 INFO [main] com.tencent.angel.worker.WorkerService: Starting workerserver service at 10.39.70.137:20254 2019-06-14 14:09:37,339 INFO [main] com.tencent.angel.worker.Worker: Init counter updater 2019-06-14 14:09:37,431 INFO [main] com.tencent.angel.psagent.CounterUpdater: Using ResourceCalculatorProcessTree : [ 27795 27922 ] 2019-06-14 14:09:37,431 INFO [main] com.tencent.angel.worker.Worker: Register to master and start the heartbeat thread 2019-06-14 14:09:37,432 INFO [main] com.tencent.angel.worker.Worker: Get data splits from master 2019-06-14 14:09:37,432 INFO [Worker Heartbeat] com.tencent.angel.worker.Worker: Register to master 2019-06-14 14:09:37,452 INFO [Worker Heartbeat] com.tencent.angel.worker.Worker: worker register finished! 2019-06-14 14:09:37,581 INFO [main] com.tencent.angel.worker.Worker: Init and start task manager and all task 2019-06-14 14:09:37,582 INFO [main] com.tencent.angel.worker.task.TaskManager: start all tasks 2019-06-14 14:09:37,582 INFO [main] com.tencent.angel.worker.task.TaskManager: start task task_27 with context=TaskContext [taskId=task_27, taskIdProto=taskIndex: 27 , context=com.tencent.angel.psagent.task.TaskContext@4a3631f8TaskContext [index=27, matrix clocks=(matrixId=0,clock=2)(matrixId=1,clock=2)(matrixId=2,clock=2)(matrixId=3,clock=2)(matrixId=4,clock=2)(matrixId=5,clock=2)(matrixId=6,clock=2)(matrixId=7,clock=2)(matrixId=8,clock=2)(matrixId=9,clock=2)(matrixId=10,clock=2)(matrixId=11,clock=2)(matrixId=12,clock=2)(matrixId=13,clock=2)(matrixId=14,clock=2)(matrixId=15,clock=2)]] 2019-06-14 14:09:37,583 INFO [pool-5-thread-1] com.tencent.angel.worker.task.Task: task task_27 is running. 2019-06-14 14:09:37,593 INFO [pool-5-thread-1] com.tencent.angel.worker.task.Task: userTaskClass = class com.tencent.angel.ml.GBDT.GBDTTrainTask task index = 27, name = Thread-22 2019-06-14 14:09:37,726 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: train.error does not have default value! 2019-06-14 14:09:37,726 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: log.likelihood does not have default value! 2019-06-14 14:09:37,726 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: validate.loss does not have default value! 2019-06-14 14:09:37,727 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: train.loss does not have default value! 2019-06-14 14:09:37,727 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: validate.error does not have default value! 2019-06-14 14:09:37,728 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: ml.gbdt.max.node.num does not have default value! 2019-06-14 14:09:37,730 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: inctrain does not have default value! 2019-06-14 14:09:37,730 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: predict does not have default value! 2019-06-14 14:09:37,731 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: train does not have default value! 2019-06-14 14:09:37,732 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel-default.xml does not have default value! 2019-06-14 14:09:37,733 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel-site.xml does not have default value! 2019-06-14 14:09:37,733 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel. does not have default value! 2019-06-14 14:09:37,733 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.am. does not have default value! 2019-06-14 14:09:37,733 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.worker. does not have default value! 2019-06-14 14:09:37,733 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.ps. does not have default value! 2019-06-14 14:09:37,733 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.task. does not have default value! 2019-06-14 14:09:37,733 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.workergroup. does not have default value! 2019-06-14 14:09:37,734 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.train.data.path does not have default value! 2019-06-14 14:09:37,734 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.predict.data.path does not have default value! 2019-06-14 14:09:37,734 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.job.input.path does not have default value! 2019-06-14 14:09:37,734 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.predict.out.path does not have default value! 2019-06-14 14:09:37,734 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.temp.path does not have default value! 2019-06-14 14:09:37,734 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.client.type does not have default value! 2019-06-14 14:09:37,734 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.save.model.path does not have default value! 2019-06-14 14:09:37,742 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.log.path does not have default value! 2019-06-14 14:09:37,742 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.load.model.path does not have default value! 2019-06-14 14:09:37,742 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.job.jar does not have default value! 2019-06-14 14:09:37,743 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.ml.conf does not have default value! 2019-06-14 14:09:37,743 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.job.libjars does not have default value! 2019-06-14 14:09:37,743 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: queue does not have default value! 2019-06-14 14:09:37,743 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.app.config.file does not have default value! 2019-06-14 14:09:37,743 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.job.cache.archives does not have default value! 2019-06-14 14:09:37,743 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.job.cache.files does not have default value! 2019-06-14 14:09:37,743 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.job.complete.cancel.delegation.tokens does not have default value! 2019-06-14 14:09:37,744 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.job.submit.host does not have default value! 2019-06-14 14:09:37,744 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.job.submit.host.address does not have default value! 2019-06-14 14:09:37,744 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.submit.user.name does not have default value! 2019-06-14 14:09:37,744 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.job.dir does not have default value! 2019-06-14 14:09:37,744 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.app.user.resource.files does not have default value! 2019-06-14 14:09:37,744 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.app.serilize.state.file does not have default value! 2019-06-14 14:09:37,744 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.output.path does not have default value! 2019-06-14 14:09:37,744 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.tmp.output.path.prefix does not have default value! 2019-06-14 14:09:37,744 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.tmp.output.path does not have default value! 2019-06-14 14:09:37,745 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: ANGEL does not have default value! 2019-06-14 14:09:37,745 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.jobid does not have default value! 2019-06-14 14:09:37,745 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: job.xml does not have default value! 2019-06-14 14:09:37,745 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.cluster.local.dir does not have default value! 2019-06-14 14:09:37,748 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.workergroup.actual.number does not have default value! 2019-06-14 14:09:37,748 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.worker.env does not have default value! 2019-06-14 14:09:37,748 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.worker.java.opts does not have default value! 2019-06-14 14:09:37,749 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.workergroup.failed.tolerate does not have default value! 2019-06-14 14:09:37,749 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.worker.max-attempts does not have default value! 2019-06-14 14:09:37,749 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.task.actual.number does not have default value! 2019-06-14 14:09:37,750 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.ps.backup.matrices does not have default value! 2019-06-14 14:09:37,750 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.ps.max-attempts does not have default value! 2019-06-14 14:09:37,750 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.ps.child.opts does not have default value! 2019-06-14 14:09:37,755 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.matrixtransfer.max.requestnum does not have default value! 2019-06-14 14:09:37,755 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.psagent. does not have default value! 2019-06-14 14:09:37,755 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.ps.ip.list does not have default value! 2019-06-14 14:09:37,756 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.psagent.java.opts does not have default value! 2019-06-14 14:09:37,756 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.psagent.iplist does not have default value! 2019-06-14 14:09:37,756 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.model.parse.name does not have default value! 2019-06-14 14:09:37,756 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.parse.model.path does not have default value! 2019-06-14 14:09:37,756 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: ml.connection.timeout does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: netty.server.io.threads does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: netty.io.mode does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: netty.client.io.threads does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: ml.rpc.timeout does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.app.type does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.pyangel.python does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.pyangel.pyfile does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.pyangel.pyfile.dependencies does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.plugin.service.enable does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.sharding.num does not have default value! 2019-06-14 14:09:37,757 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.sharding.concurrent.capacity does not have default value! 2019-06-14 14:09:37,758 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.sharding.model.class does not have default value! 2019-06-14 14:09:37,758 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.master.ip does not have default value! 2019-06-14 14:09:37,758 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.master.port does not have default value! 2019-06-14 14:09:37,758 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.model.name does not have default value! 2019-06-14 14:09:37,758 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.model.load.timeout.minute does not have default value! 2019-06-14 14:09:37,758 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.model.load.check.inteval.second does not have default value! 2019-06-14 14:09:37,758 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.model.load.type does not have default value! 2019-06-14 14:09:37,758 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: angel.serving.predict.local.output does not have default value! 2019-06-14 14:09:37,758 INFO [pool-5-thread-1] com.tencent.angel.ml.core.conf.SharedConf: [I@14c274ed does not have default value! 2019-06-14 14:09:37,813 INFO [pool-5-thread-1] com.tencent.angel.ml.model.PSModel: After training matrix gbdt.node.predict will be saved to /user/weibo_bigdata_push/weichao/angel/fm/ori_model_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190614_1406 2019-06-14 14:09:37,813 INFO [pool-5-thread-1] com.tencent.angel.ml.model.PSModel: After training matrix gbdt.split.value will be saved to /user/weibo_bigdata_push/weichao/angel/fm/ori_model_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190614_1406 2019-06-14 14:09:37,813 INFO [pool-5-thread-1] com.tencent.angel.ml.model.PSModel: After training matrix gbdt.split.feature will be saved to /user/weibo_bigdata_push/weichao/angel/fm/ori_model_lr0.0005_rank8_reg00.001_reg10.001_reg20.001_20190614_1406 2019-06-14 14:09:37,813 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: 1. initialize 2019-06-14 14:09:37,813 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: Create data meta, numFeature=110, nonzero=-1 2019-06-14 14:09:37,815 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: Finish creating data meta, numRow=0, numCol=110, nonzero=-1 2019-06-14 14:09:37,815 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: Create data meta, numFeature=110, nonzero=-1 2019-06-14 14:09:37,815 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: Finish creating data meta, numRow=0, numCol=110, nonzero=-1 2019-06-14 14:09:37,816 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: Build data info cost 3 ms 2019-06-14 14:09:37,816 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: 2.train 2019-06-14 14:09:37,839 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: **Current phase: CREATE_SKETCH, clock[0]** 2019-06-14 14:09:38,054 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: clock and flush matrices [gbdt.feature.category, gbdt.sketch] cost 214 ms 2019-06-14 14:09:38,192 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: **Current phase: GET_SKETCH, clock[1]** 2019-06-14 14:09:38,192 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: ------Get sketch from PS------ 2019-06-14 14:09:38,193 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 3 2019-06-14 14:09:51,836 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 3 over 2019-06-14 14:09:51,881 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Get sketch cost: 13689 ms 2019-06-14 14:09:51,882 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Number of splits of categorical features: [] 2019-06-14 14:09:51,883 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: **Current phase: SAMPLE_FEATURE, clock[2]** 2019-06-14 14:09:51,883 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: ------Sample feature------ 2019-06-14 14:09:51,907 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: clock and flush matrices [] cost 24 ms 2019-06-14 14:09:51,909 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: **Current phase: NEW_TREE, clock[3]** 2019-06-14 14:09:51,909 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: ------Create new tree------ 2019-06-14 14:09:51,912 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: ------Calculate grad pairs------ 2019-06-14 14:09:51,913 ERROR [pool-5-thread-1] com.tencent.angel.worker.task.Task: task runner error java.lang.ArrayIndexOutOfBoundsException: 0 at com.tencent.angel.ml.GBDT.algo.GBDTController.calGradPairs(GBDTController.java:242) at com.tencent.angel.ml.GBDT.algo.GBDTController.createNewTree(GBDTController.java:462) at com.tencent.angel.ml.GBDT.GBDTLearner.train(GBDTLearner.scala:132) at com.tencent.angel.ml.GBDT.GBDTTrainTask.train(GBDTTrainTask.scala:47) at com.tencent.angel.ml.core.TrainTask.run(TrainTask.scala:50) at com.tencent.angel.worker.task.Task.runUser(Task.java:92) at com.tencent.angel.worker.task.Task.run(Task.java:68) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 2019-06-14 14:09:51,928 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: worker failed message : taskid=task_27, state=FAILED, diagnostics=[task runner error: java.lang.ArrayIndexOutOfBoundsException: 0 at com.tencent.angel.ml.GBDT.algo.GBDTController.calGradPairs(GBDTController.java:242) at com.tencent.angel.ml.GBDT.algo.GBDTController.createNewTree(GBDTController.java:462) at com.tencent.angel.ml.GBDT.GBDTLearner.train(GBDTLearner.scala:132) at com.tencent.angel.ml.GBDT.GBDTTrainTask.train(GBDTTrainTask.scala:47) at com.tencent.angel.ml.core.TrainTask.run(TrainTask.scala:50) at com.tencent.angel.worker.task.Task.runUser(Task.java:92) at com.tencent.angel.worker.task.Task.run(Task.java:68) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) ], send it to appmaster success 2019-06-14 14:09:51,928 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: start to close all modules in worker 2019-06-14 14:09:51,928 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: stop workerService 2019-06-14 14:09:51,928 INFO [pool-5-thread-1] com.tencent.angel.worker.WorkerService: stop rpc server 2019-06-14 14:09:51,929 INFO [pool-5-thread-1] com.tencent.angel.ipc.NettyServer: Stopping server on 20254 2019-06-14 14:09:51,936 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: stop psagent 2019-06-14 14:09:51,937 INFO [pool-5-thread-1] com.tencent.angel.psagent.PSAgent: stop heartbeat thread! 2019-06-14 14:09:51,937 INFO [pool-5-thread-1] com.tencent.angel.psagent.PSAgent: stop op log merger 2019-06-14 14:09:51,940 WARN [oplog-merge-dispatcher] com.tencent.angel.psagent.matrix.oplog.cache.MatrixOpLogCache: oplog-merge-dispatcher interrupted 2019-06-14 14:09:51,940 INFO [pool-5-thread-1] com.tencent.angel.psagent.PSAgent: stop clock cache 2019-06-14 14:09:51,940 INFO [pool-5-thread-1] com.tencent.angel.psagent.PSAgent: stop matrix cache 2019-06-14 14:09:51,940 INFO [clock-syncer] com.tencent.angel.psagent.clock.ClockCache: sync thread is interrupted 2019-06-14 14:09:51,940 INFO [pool-5-thread-1] com.tencent.angel.psagent.PSAgent: stop user request adapater 2019-06-14 14:09:51,941 INFO [pool-5-thread-1] com.tencent.angel.psagent.PSAgent: stop rpc dispacher 2019-06-14 14:09:51,941 INFO [pool-5-thread-1] com.tencent.angel.common.transport.ChannelManager2: Channel manager stop 2019-06-14 14:09:51,951 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: stop heartbeat thread 2019-06-14 14:09:51,951 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: stop taskmanager 2019-06-14 14:09:51,951 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: stop data block manager End of LogType:syslog
Container: container_e18_1559141668369_2794446_01_000034 on yz70178.hadoop.data.sina.com.cn_45454 是这样
numRow=0, numCol=110, nonzero=-1 没有正确读入数据
需要check一下是不是正确传入参数,不行的话按照文档的参数设置格式: https://github.com/Angel-ML/angel/blob/master/docs/algo/gbdt_on_angel.md
推荐用最新的在spark上开发的GBDT算法: https://github.com/Angel-ML/angel/blob/master/docs/algo/sona/feature_gbdt_sona.md
找到原因了,原因是worker数量调大了,导致所有数据都在第一个worker上,其他worker没有数据因此报错。在这里我问一下,咱们的每个worker会处理多大的数据?我这里做一个参考。
2019-06-14 16:15:32,565 INFO [pool-5-thread-1] com.tencent.angel.worker.storage.DiskDataBlock: create diskstorage, base=usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/hdfsdatacache/a2da6475-9cf5-4477-ba14-1f6dbe44e0fa 2019-06-14 16:15:32,583 INFO [pool-5-thread-1] com.tencent.angel.worker.storage.DiskDataBlock: KVDiskStorage create a new file, filePath = /data7/hadoop/local/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/hdfsdatacache/a2da6475-9cf5-4477-ba14-1f6dbe44e0fa_WorkerAttempt_1_1_0_1_0 2019-06-14 16:16:06,221 INFO [pool-5-thread-1] com.tencent.angel.worker.storage.DiskDataBlock: KVDiskStorage create a new file, filePath = /data8/hadoop/local/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/hdfsdatacache/a2da6475-9cf5-4477-ba14-1f6dbe44e0fa_WorkerAttempt_1_1_0_1_1 2019-06-14 16:16:39,867 INFO [pool-5-thread-1] com.tencent.angel.worker.storage.DiskDataBlock: KVDiskStorage create a new file, filePath = /data9/hadoop/local/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/hdfsdatacache/a2da6475-9cf5-4477-ba14-1f6dbe44e0fa_WorkerAttempt_1_1_0_1_2 2019-06-14 16:17:14,161 INFO [pool-5-thread-1] com.tencent.angel.worker.storage.DiskDataBlock: KVDiskStorage create a new file, filePath = /data1/hadoop/local/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/hdfsdatacache/a2da6475-9cf5-4477-ba14-1f6dbe44e0fa_WorkerAttempt_1_1_0_1_3 2019-06-14 16:17:49,774 INFO [pool-5-thread-1] com.tencent.angel.worker.storage.DiskDataBlock: KVDiskStorage create a new file, filePath = /data0/hadoop/local/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/hdfsdatacache/a2da6475-9cf5-4477-ba14-1f6dbe44e0fa_WorkerAttempt_1_1_0_1_4 2019-06-14 16:18:22,857 INFO [pool-5-thread-1] com.tencent.angel.worker.storage.DiskDataBlock: KVDiskStorage create a new file, filePath = /data10/hadoop/local/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/hdfsdatacache/a2da6475-9cf5-4477-ba14-1f6dbe44e0fa_WorkerAttempt_1_1_0_1_5 2019-06-14 16:18:59,737 INFO [pool-5-thread-1] com.tencent.angel.worker.storage.DiskDataBlock: KVDiskStorage create a new file, filePath = /data11/hadoop/local/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/hdfsdatacache/a2da6475-9cf5-4477-ba14-1f6dbe44e0fa_WorkerAttempt_1_1_0_1_6 2019-06-14 16:19:39,147 INFO [pool-5-thread-1] com.tencent.angel.worker.storage.DiskDataBlock: KVDiskStorage create a new file, filePath = /data2/hadoop/local/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/hdfsdatacache/a2da6475-9cf5-4477-ba14-1f6dbe44e0fa_WorkerAttempt_1_1_0_1_7 2019-06-14 16:20:15,750 INFO [pool-5-thread-1] com.tencent.angel.worker.storage.DiskDataBlock: KVDiskStorage create a new file, filePath = /data3/hadoop/local/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/usercache/weibo_bigdata_push/appcache/application_1559141668369_2811431/hdfsdatacache/a2da6475-9cf5-4477-ba14-1f6dbe44e0fa_WorkerAttempt_1_1_0_1_8 一个worker建了这么多的file的话,我的worker数量是不是就可以调大点了
这主要跟读取HDFS的split大小有关,数据量比较小的时候,用的worker太多会导致有的worker分不到数据。 一般是根据数据大小来选择,HDFS默认的好像是64MB。
也就是hdfs分了几个子文件。我的worker数量不要多余这个文件数量,最好可以跟它接近是吧~ok多谢!
hi,我这边遇到一个这样的问题,当我把树的深度设为3的时候,模型可以正常训练,但是当深度设为4当时候模型会训练不出来,查看log日志没有发现exception,打印所有发现前面树都会正常分裂节点,建新树。但是在worker的log日志里面会有GC (Allocation Failure), 2019-06-18T16:15:33.447+0800: 68.657: [CMS-concurrent-sweep-start] 2019-06-18T16:15:34.173+0800: 69.383: [CMS-concurrent-sweep: 0.726/0.726 secs] [Times: user=0.52 sys=0.22, real=0.73 secs] 2019-06-18T16:15:34.177+0800: 69.387: [CMS-concurrent-reset-start] 2019-06-18T16:15:34.189+0800: 69.399: [CMS-concurrent-reset: 0.012/0.012 secs] [Times: user=0.01 sys=0.00, real=0.02 secs] 2019-06-18T16:15:47.769+0800: 82.978: [GC (Allocation Failure) 2019-06-18T16:15:47.769+0800: 82.979: [ParNew: 1827928K->217069K(1834688K), 0.1055441 secs] 3778549K->2167690K(5085724K), 0.1057634 secs] [Times: user=1.84 sys=0.26, real=0.10 secs] 2019-06-18T16:15:56.448+0800: 91.658: [GC (Allocation Failure) 2019-06-18T16:15:56.448+0800: 91.658: [ParNew: 1684845K->224900K(1834688K), 0.1049570 secs] 3635466K->2175520K(5085724K), 0.1051838 secs] [Times: user=2.16 sys=0.12, real=0.10 secs] 2019-06-18T16:16:00.718+0800: 95.928: [GC (Allocation Failure) 2019-06-18T16:16:00.718+0800: 95.928: [ParNew: 1692676K->218018K(1834688K), 0.0769851 secs] 3643296K->2168638K(5085724K), 0.0772154 secs] [Times: user=1.54 sys=0.09, real=0.07 secs] 2019-06-18T16:16:02.983+0800: 98.193: [GC (Allocation Failure) 2019-06-18T16:16:02.983+0800: 98.193: [ParNew: 1685794K->226923K(1834688K), 0.0970035 secs] 3636414K->2177544K(5085724K), 0.0972066 secs] [Times: user=1.70 sys=0.05, real=0.10 secs] 2019-06-18T16:16:08.082+0800: 103.292: [GC (Allocation Failure) 2019-06-18T16:16:08.082+0800: 103.292: [ParNew: 1694699K->219451K(1834688K), 0.0807246 secs] 3645320K->2170071K(5085724K), 0.0810081 secs] [Times: user=1.55 sys=0.01, real=0.08 secs] 2019-06-18T16:16:13.220+0800: 108.430: [GC (Allocation Failure) 2019-06-18T16:16:13.220+0800: 108.430: [ParNew: 1687227K->166313K(1834688K), 0.0652073 secs] 3637847K->2116933K(5085724K), 0.0655008 secs] [Times: user=1.15 sys=0.10, real=0.06 secs] Heap par new generation total 1834688K, used 175193K [0x0000000670000000, 0x00000006f6600000, 0x00000006f6600000) eden space 1467776K, 0% used [0x0000000670000000, 0x00000006708ac1f0, 0x00000006c9960000) from space 366912K, 45% used [0x00000006dffb0000, 0x00000006ea21a538, 0x00000006f6600000) to space 366912K, 0% used [0x00000006c9960000, 0x00000006c9960000, 0x00000006dffb0000) concurrent mark-sweep generation total 3251036K, used 1950620K [0x00000006f6600000, 0x00000007bccd7000, 0x00000007c0000000) Metaspace used 39237K, capacity 39488K, committed 39748K, reserved 1085440K class space used 4599K, capacity 4659K, committed 4676K, reserved 1048576K End of LogType:gc.log
LogType:stderr Log Upload Time:Tue Jun 18 16:17:09 +0800 2019 LogLength:517 Log Contents: Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=100M; support was removed in 8.0 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=200M; support was removed in 8.0 Java HotSpot(TM) 64-Bit Server VM warning: UseCMSCompactAtFullCollection is deprecated and will likely be removed in a future release. Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory. (error = 22) Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory. (error = 1) End of LogType:stderr
LogType:stdout Log Upload Time:Tue Jun 18 16:17:09 +0800 2019 LogLength:0 Log Contents: End of LogType:stdout
这种情况是不是增加memory可以解决?增加哪个memory?worker的,还是ps的? @bluesjjw.
worker是直接fail了,还是卡住没动? 如果直接fail,那应该是内存不够,树深度增加 现在的内存给了多少?需要的内存是要更多的,增加worker的memory
我把ps内存和worker内存都调高之后依然跑不出来,正常几分钟就能跑完深度为3都gdbt树模型,跑深度为4的模型已经跑了一天,且在worker 的log日志中出现 2019-06-19 10:59:45,821 INFO [pool-9-thread-15] com.tencent.angel.ml.GBDT.algo.AfterSplitThread: Active node[4]: split feature[45] value[NaN], lossChg[73969.500000], sumGrad[4811197.000000], sumHess[4772861.500000] 2019-06-19 10:59:45,821 INFO [pool-9-thread-4] com.tencent.angel.ml.GBDT.algo.AfterSplitThread: Active node[5]: split feature[3] value[0.015144], lossChg[218572.156250], sumGrad[1531506.000000], sumHess[8512740.000000] 2019-06-19 18:55:47,606 WARN [Worker Heartbeat] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: Error reading the stream java.io.IOException: No such process 2019-06-19 19:59:51,224 WARN [Worker Heartbeat] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: Error reading the stream java.io.IOException: No such process 2019-06-20 03:07:06,671 WARN [Worker Heartbeat] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: Error reading the stream java.io.IOException: No such process
workerNumber=80 workerMemory=15 workerCpu=4 taskNumber=5 taskMemory=10 storageLevel=memory_disk PSNumber=4 PSMemory=15
这是一个50g的数量为一亿四千万条样本数据,feature数量47.我觉的menory设的很高了吧?我加了一倍
@bluesjjw.
--ml.gbdt.tree.num 4 \ --ml.gbdt.tree.depth 4 \ --ml.gbdt.split.num 10 \
ps端报错: 2019-06-20 13:57:01,138 INFO [ForkJoinPool-1-worker-17] com.tencent.angel.ps.storage.matrix.ServerMatrix: Rename from hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node6/_tmp.psmeta to hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node6/psmeta 2019-06-20 13:57:01,157 WARN [DataStreamer for file /user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node7/_tmp.psmeta] org.apache.hadoop.hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) at java.lang.Thread.join(Thread.java:1326) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546) 2019-06-20 13:57:01,158 INFO [ForkJoinPool-1-worker-25] com.tencent.angel.ps.storage.matrix.ServerMatrix: Rename from hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node7/_tmp.psmeta to hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node7/psmeta 2019-06-20 13:57:01,161 INFO [ForkJoinPool-1-worker-1] com.tencent.angel.ps.storage.matrix.ServerMatrix: Rename from hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node9/_tmp.psmeta to hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node9/psmeta 2019-06-20 13:57:01,175 INFO [ForkJoinPool-1-worker-0] com.tencent.angel.ps.storage.matrix.ServerMatrix: Rename from hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node5/_tmp.psmeta to hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node5/psmeta 2019-06-20 13:57:01,175 INFO [ForkJoinPool-1-worker-28] com.tencent.angel.ps.storage.matrix.ServerMatrix: Rename from hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node4/_tmp.psmeta to hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.grad.histogram.node4/psmeta 2019-06-20 13:57:01,229 INFO [ForkJoinPool-1-worker-24] com.tencent.angel.ps.storage.matrix.ServerMatrix: Rename from hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.active.nodes/_tmp.psmeta to hdfs:/user/weibo_bigdata_push/angel_stage/application_1560572543914_679255_69396bda-cc1e-4b80-ab2c-12af5b6acfbe/snapshot/ParameterServer_3/_tmp.0/gbdt.active.nodes/psmeta
worker端没有错误信息,卡住了?
嗯嗯,不过问题让我解决了,我把worker数量从80调到了50,深度为4的就跑出来了。如果是这样的话是不是因为之前worker数量太多,导致每个worker训练样本不够啊?(worker之前是80,每个worker分到大概不到200万条数据,validate可以分到不到20万条数据。)
是有可能,要根据数据集的总大小来确定worker数量,之前讨论过这个问题对吧?
是,但是之前是因为worker端设置过大,导致有些worker没有分配到数据导致报的错,但是这次我特意看了一下没有出现这样的问题。
您好,我这边在把gdbt模型进行convert之后发现在gbdt.split.feature里面对应feature坐标里面出现-1.0我想问下这个-1是不是就是我的第一个feature啊?而那些feature等于0.0的就说明这个节点就不分裂了啊?@bluesjjw.
feature=-1表示不分裂 feature value=0还是要分裂的,这种一般是categorical feature
哦哦,好的
那再问一下,不分裂的节点我们还会依然给他分配叶子节点的标号吧?我看到我的feature有些是-1,但是深度为7的一棵树还是一共127个节点,刚好2*7-1
而且按照你的说法如果feature=-1是不分裂,那我这里标号63~126的叶子节点都应该是-1啊,但是我这边都是0.
存的模型里面,会有1-127个节点,但是-1的节点和它的子孙节点都没有分裂,这是为了存模型的时候统一
那请问一下,每个点的value值就是简单代表小于这个value进入左子树,大于这个value,进入右子树是吧?
是的,小与等于是左子树,大于是右
@bluesjjw hello,我这边GBDT模型训练有一个新的问题,我在训练多棵树的时候,会出现训练中间卡住的现象。但是程序并不会down掉,也没有报错。我在观察日志的时候发现卡住日志的部分有节点value值为NaN的情况,我在想是不是因为这个原因。还是其他原因?卡住日志最后部分如下:
[5], start[4323443], end[4333443] 2019-09-24 12:02:54,646 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Calculate thread: nid[5], start[4333443], end[4343443] 2019-09-24 12:02:54,646 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Calculate thread: nid[5], start[4343443], end[4353443] 2019-09-24 12:02:54,646 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Calculate thread: nid[5], start[4353443], end[4363443] 2019-09-24 12:02:54,646 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Calculate thread: nid[5], start[4363443], end[4373443] 2019-09-24 12:02:54,646 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Calculate thread: nid[5], start[4373443], end[4383443] 2019-09-24 12:02:54,646 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Calculate thread: nid[5], start[4383443], end[4393443] 2019-09-24 12:02:54,646 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Calculate thread: nid[5], start[4393443], end[4403443] 2019-09-24 12:02:54,646 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Calculate thread: nid[5], start[4403443], end[4410420] 2019-09-24 12:02:57,537 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Subtract thread: nid[4] 2019-09-24 12:02:57,537 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Subtract thread: nid[6] 2019-09-24 12:02:57,539 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Run active node cost: 2895 ms 2019-09-24 12:02:57,735 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: clock and flush matrices [gbdt.grad.histogram.node6, gbdt.grad.histogram.node5, gbdt.grad.histogram.node4, gbdt.grad.histogram.node3] cost 196 ms 2019-09-24 12:02:57,745 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: **Current phase: FIND_SPLIT, clock[503]** 2019-09-24 12:02:57,745 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: ------Find split------ 2019-09-24 12:02:57,745 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Task[0] responsible tree node: [3] 2019-09-24 12:02:57,745 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 460 2019-09-24 12:02:58,744 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 460 over 2019-09-24 12:02:58,753 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Pull histogram from PS cost 1008 ms 2019-09-24 12:02:58,753 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Best split of node[3]: feature[45], value[NaN], losschg[2212614.000000] 2019-09-24 12:02:58,758 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Find split cost: 1013 ms 2019-09-24 12:02:58,908 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: clock and flush matrices [gbdt.split.value, gbdt.node.grad.stats, gbdt.split.feature, gbdt.split.gain] cost 150 ms 2019-09-24 12:02:58,909 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.GBDTLearner: **Current phase: AFTER_SPLIT, clock[504]** 2019-09-24 12:02:58,909 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: ------After split------ 2019-09-24 12:02:58,909 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 461 2019-09-24 12:02:58,949 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 461 over 2019-09-24 12:02:58,963 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 461 2019-09-24 12:02:58,963 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 461 over 2019-09-24 12:02:58,973 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 461 2019-09-24 12:02:58,973 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 461 over 2019-09-24 12:02:58,994 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 461 2019-09-24 12:02:58,994 INFO [pool-5-thread-1] com.tencent.angel.psagent.consistency.ConsistencyController: wait for clock 461 over 2019-09-24 12:02:59,015 INFO [pool-5-thread-1] com.tencent.angel.ml.GBDT.algo.GBDTController: Get split result from PS cost 106 ms 2019-09-24 12:02:59,016 INFO [pool-9-thread-6] com.tencent.angel.ml.GBDT.algo.AfterSplitThread: Active node[3]: split feature[45] value[NaN], lossChg[2212614.000000], sumGrad[29335740.000000], sumHess[27589250.000000] 2019-09-24 12:02:59,030 INFO [pool-9-thread-11] com.tencent.angel.ml.GBDT.algo.AfterSplitThread: Active node[5]: split feature[42] value[-2.000000], lossChg[1992325.625000], sumGrad[-13305986.000000], sumHess[42665720.000000] 2019-09-24 12:02:59,030 INFO [pool-9-thread-16] com.tencent.angel.ml.GBDT.algo.AfterSplitThread: Active node[6]: split feature[42] value[-2.000000], lossChg[1765322.000000], sumGrad[-33581768.000000], sumHess[42314952.000000] 2019-09-24 12:02:59,043 INFO [pool-9-thread-5] com.tencent.angel.ml.GBDT.algo.AfterSplitThread: Active node[4]: split feature[3] value[0.012293], lossChg[2749910.500000], sumGrad[6899572.000000], sumHess[48063624.000000]
在进行小数据量(5000条)的gdbt模型训练可以正常输出训练结果,但是在20w条数据量时,程序会运行16个小时。某些worker报org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: Error reading the stream java.io.IOException: No such process异常。
训练200w条数据时报 java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) at java.lang.Thread.join(Thread.java:1326) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:638) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeInternal(DFSOutputStream.java:606) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:602)
异常。
我想问是不是我ps的数量设的不对? 我的启动脚本如下: epochNum=3 learnType=c workerNumber=5 workerMemory=5 workerCpu=2 taskNumber=5 taskMemory=5 storageLevel=memory_disk PSNumber=1 PSMemory=5
$ANGEL_HOME/bin/angel-submit \ --action.type train \ --angel.app.submit.class com.tencent.angel.ml.GBDT.GBDTRunner \ --angel.train.data.path $input_path \ --angel.save.model.path $model_path \ --angel.log.path $log_path \ --ml.data.type libsvm \ --ml.feature.index.range $featureNum \ --ml.feature.num $featureNum \ --ml.gbdt.tree.num 3 \ --ml.gbdt.tree.depth 3 \ --ml.gbdt.split.num 10 \ --ml.data.validate.ratio 0.1 \ --ml.gbdt.sample.ratio 1 \ --ml.epoch.num $epochNum \ --ml.learn.rate $learnRate \ --angel.workergroup.number $workerNumber \ --angel.worker.memory.gb $workerMemory \ --angel.worker.cpu.vcores $workerCpu \ --angel.worker.task.number $taskNumber \ --angel.task.data.storage.level $storageLevel \ --angel.task.memorystorage.max.gb $taskMemory \ --angel.ps.number $PSNumber \ --angel.ps.memory.gb $PSMemory \ --angel.am.memory.gb 4 \ --angel.staging.dir /user/weibo_bigdata_push/angel_stage \ --angel.tmp.output.path.prefix /user/weibo_bigdata_push/angel_stage
请问一下是ps数量或者memroy设小么? 为什么训练20w条数据还不够么?