Angel-ML / angel

A Flexible and Powerful Parameter Server for large-scale machine learning
Other
6.73k stars 1.6k forks source link

为什么我无法向yarn提交LDA #753

Open wqh17101 opened 5 years ago

wqh17101 commented 5 years ago

我可以正常运行 ./angel-example com.tencent.angel.example.ml.LDALocalExample 但是当我运行下面这个时 image 我得到 image 并没有跑起来

wqh17101 commented 5 years ago

同时当我使用 angle-submit提交LR时

expr: syntax error /data1/etl_sys/common/DayGen.ini: line 12: [: -eq: unary operator expected /data1/etl_sys/common/DayGen.ini: line 24: [: -eq: unary operator expected /usr/bin/which: no angel-submit in (.) dirname: missing operand Try 'dirname --help' for more information. /usr/local/jdk1.8.0_162/bin/java -Xmx1000m -Dhadoop.log.dir=/usr/local/share/hadoop-2.6.0-cdh5.14.4/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/local/share/hadoop-2.6.0-cdh5.14.4 -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/cloudera/parcels/GPLEXTRAS-5.14.4-1.cdh5.14.4.p0.3/lib/hadoop/lib/native:: -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true com.tencent.angel.utils.AngelRunJar --angel.app.submit.class com.tencent.angel.ml.core.graphsubmit.GraphRunner --angel.train.data.path hdfs://jr-hdfs/tmp/wangqinghua/lda/angel_test/data/abalone_8d_train.libsvm --angel.log.path hdfs://jr-hdfs/tmp/wangqinghua/lda/angel_test/log --angel.save.model.path hdfs://jr-hdfs/tmp/wangqinghua/lda/angel_test/model --action.type train --ml.model.class.name com.tencent.angel.ml.classification.LogisticRegression --ml.epoch.num 10 --ml.data.type libsvm --ml.feature.index.range 1024 --angel.job.name LR_test --angel.am.memory.gb 2 --angel.worker.memory.gb 2 --angel.ps.memory.gb 2 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/share/hadoop-2.6.0-cdh5.14.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data2/wangqinghua/angel/lib/slf4j-log4j12-1.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 19/05/05 20:04:58 INFO utils.AngelRunJar: angelHomePath conf path=/data2/wangqinghua/angel/bin/..//conf/angel-site.xml 19/05/05 20:04:58 INFO utils.AngelRunJar: load system config file success 19/05/05 20:04:58 INFO utils.AngelRunJar: jars loaded: file:///data2/wangqinghua/angel/bin/..//lib/jniloader-1.1.jar,file:///data2/wangqinghua/angel/bin/..//lib/native_system-java-1.1.jar,file:///data2/wangqinghua/angel/bin/..//lib/arpack_combined_all-0.1.jar,file:///data2/wangqinghua/angel/bin/..//lib/all-1.1.2.pom,file:///data2/wangqinghua/angel/bin/..//lib/core-1.1.2.jar,file:///data2/wangqinghua/angel/bin/..//lib/netlib-native_ref-linux-armhf-1.1-natives.jar,file:///data2/wangqinghua/angel/bin/..//lib/netlib-native_ref-linux-i686-1.1-natives.jar,file:///data2/wangqinghua/angel/bin/..//lib/netlib-native_ref-linux-x86_64-1.1-natives.jar,file:///data2/wangqinghua/angel/bin/..//lib/netlib-native_system-linux-armhf-1.1-natives.jar,file:///data2/wangqinghua/angel/bin/..//lib/netlib-native_system-linux-i686-1.1-natives.jar,file:///data2/wangqinghua/angel/bin/..//lib/netlib-native_system-linux-x86_64-1.1-natives.jar,file:///data2/wangqinghua/angel/bin/..//lib/jackson-annotations-2.6.0.jar,file:///data2/wangqinghua/angel/bin/..//lib/jackson-core-2.6.7.jar,file:///data2/wangqinghua/angel/bin/..//lib/jackson-core-asl-1.8.8.jar,file:///data2/wangqinghua/angel/bin/..//lib/jackson-databind-2.6.7.jar,file:///data2/wangqinghua/angel/bin/..//lib/jackson-jaxrs-1.8.3.jar,file:///data2/wangqinghua/angel/bin/..//lib/jackson-mapper-asl-1.8.8.jar,file:///data2/wangqinghua/angel/bin/..//lib/jackson-module-paranamer-2.6.5.jar,file:///data2/wangqinghua/angel/bin/..//lib/jackson-module-scala_2.11-2.6.5.jar,file:///data2/wangqinghua/angel/bin/..//lib/jackson-xc-1.8.3.jar,file:///data2/wangqinghua/angel/bin/..//lib/json4s-ast_2.11-3.4.2.jar,file:///data2/wangqinghua/angel/bin/..//lib/json4s-core_2.11-3.4.2.jar,file:///data2/wangqinghua/angel/bin/..//lib/json4s-jackson_2.11-3.4.2.jar,file:///data2/wangqinghua/angel/bin/..//lib/json4s-scalap_2.11-3.4.2.jar,file:///data2/wangqinghua/angel/bin/..//lib/netty-all-4.1.1.Final.jar,file:///data2/wangqinghua/angel/bin/..//lib/angel-ps-mllib-2.1.0.jar,file:///data2/wangqinghua/angel/bin/..//lib/angel-ps-tools-2.1.0.jar,file:///data2/wangqinghua/angel/bin/..//lib/scala-reflect-2.11.8.jar,file:///data2/wangqinghua/angel/bin/..//lib/memory-0.8.1.jar,file:///data2/wangqinghua/angel/bin/..//lib/sketches-core-0.8.1.jar,file:///data2/wangqinghua/angel/bin/..//lib/commons-pool-1.6.jar,file:///data2/wangqinghua/angel/bin/..//lib/kryo-shaded-4.0.0.jar,file:///data2/wangqinghua/angel/bin/..//lib/kryo-serializers-0.42.jar,file:///data2/wangqinghua/angel/bin/..//lib/scala-library-2.11.8.jar,file:///data2/wangqinghua/angel/bin/..//lib/angel-ps-core-2.1.0.jar,file:///data2/wangqinghua/angel/bin/..//lib/angel-ps-psf-2.1.0.jar,file:///data2/wangqinghua/angel/bin/..//lib/fastutil-7.1.0.jar,file:///data2/wangqinghua/angel/bin/..//lib/sizeof-0.3.0.jar,file:///data2/wangqinghua/angel/bin/..//lib/minlog-1.3.0.jar,file:///data2/wangqinghua/angel/bin/..//lib/breeze_2.11-0.13.jar 19/05/05 20:04:58 INFO utils.AngelRunJar: angel python file: null 19/05/05 20:04:59 INFO utils.UGITools: UGI_PROPERTY_NAME is null 19/05/05 20:04:59 INFO utils.AngelRunJar: submitClass: com.tencent.angel.ml.core.graphsubmit.GraphRunner 19/05/05 20:04:59 INFO conf.SharedConf: train does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: inctrain does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: predict does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: ml.gbdt.max.node.num does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: ml.gbdt.multi.class.strategy does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: ml.gbdt.multi.class.grad.cache does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: train.loss does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: validate.loss does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: log.likelihood does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: train.error does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: validate.error does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel-default.xml does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel-site.xml does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel. does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.am. does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.worker. does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.ps. does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.task. does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.workergroup. does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.train.data.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.validate.data.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.kerberos.keytab does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.kerberos.principal does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.kerberos.keytab.name does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.predict.data.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.job.input.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.predict.out.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.temp.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.client.type does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.save.model.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.log.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.load.model.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.job.jar does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.ml.conf does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.job.libjars does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: queue does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.app.config.file does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.job.cache.archives does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.job.cache.files does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.job.complete.cancel.delegation.tokens does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.job.submit.host does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.job.submit.host.address does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.submit.user.name does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.job.dir does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.app.user.resource.files does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.app.serilize.state.file does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.output.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.tmp.output.path.prefix does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.tmp.output.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: ANGEL does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.jobid does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: job.xml does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.cluster.local.dir does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.workergroup.actual.number does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.worker.env does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.worker.java.opts does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.worker.heartbeat.interval.ms does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.workergroup.failed.tolerate does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.worker.max-attempts does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.task.actual.number does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.ps.backup.matrices does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.ps.max-attempts does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.ps.child.opts does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.ps.partition.class does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.matrixtransfer.max.requestnum does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.psagent. does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.ps.ip.list does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.psagent.java.opts does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.psagent.iplist does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.model.parse.name does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.parse.model.path does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: ml.connection.timeout does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: netty.server.io.threads does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: netty.io.mode does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: netty.client.io.threads does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: ml.rpc.timeout does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.app.type does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.pyangel.python does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.pyangel.pyfile does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.pyangel.pyfile.dependencies does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.plugin.service.enable does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.sharding.num does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.sharding.concurrent.capacity does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.sharding.model.class does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.master.ip does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.master.port does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.model.name does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.model.load.timeout.minute does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.model.load.check.inteval.second does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.model.load.type does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: angel.serving.predict.local.output does not have default value! 19/05/05 20:04:59 INFO conf.SharedConf: [I@1b11171f does not have default value! 19/05/05 20:04:59 INFO utils.UGITools: UGI_PROPERTY_NAME is null 19/05/05 20:04:59 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm436 19/05/05 20:05:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/05/05 20:05:00 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 19/05/05 20:05:00 INFO client.AngelClient: running mode = ANGEL_PS_WORKER 19/05/05 20:05:00 INFO utils.HdfsUtil: tmp output dir is hdfs://jr-hdfs/tmp/stat/application_1550912047489_102232_d44457e9-ab29-4145-9e26-ba90b7a46e93 19/05/05 20:05:00 INFO utils.HdfsUtil: tmp output dir is hdfs://jr-hdfs/tmp/stat/application_1550912047489_102232_37508e29-7a07-4290-9189-f48d98d36f4e 19/05/05 20:05:00 INFO client.AngelClient: angel.tmp.output.path=hdfs://jr-hdfs/tmp/stat/application_1550912047489_102232_d44457e9-ab29-4145-9e26-ba90b7a46e93 19/05/05 20:05:00 INFO client.AngelClient: internal state file is hdfs://jr-hdfs/tmp/stat/application_1550912047489_102232_37508e29-7a07-4290-9189-f48d98d36f4e/state 19/05/05 20:05:00 INFO yarn.AngelYarnClient: default FileSystem: hdfs://jr-hdfs 19/05/05 20:05:00 ERROR yarn.AngelYarnClient: submit application to yarn failed. org.apache.hadoop.security.AccessControlException: Permission denied: user=stat, access=WRITE, inode="/tmp/hadoop-yarn":yarn:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:279) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:260) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:240) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:162) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3887) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3870) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:3852) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6762) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4503) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4473) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4446) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:966) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:326) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:640) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
    at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3164)
    at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:3129)
    at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1007)
    at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1003)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:1003)
    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:995)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1970)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:614)
    at com.tencent.angel.client.yarn.AngelYarnClient.copyAndConfigureFiles(AngelYarnClient.java:208)
    at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:153)
    at com.tencent.angel.ml.core.graphsubmit.GraphRunner.train(GraphRunner.scala:55)
    at com.tencent.angel.ml.core.MLRunner$class.submit(MLRunner.scala:88)
    at com.tencent.angel.ml.core.graphsubmit.GraphRunner.submit(GraphRunner.scala:29)
    at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:91)
    at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:77)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
    at com.tencent.angel.utils.AngelRunJar.submit(AngelRunJar.java:77)
    at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:44)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=stat, access=WRITE, inode="/tmp/hadoop-yarn":yarn:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:279) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:260) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:240) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:162) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3887) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3870) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:3852) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6762) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4503) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4473) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4446) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:966) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:326) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:640) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

    at org.apache.hadoop.ipc.Client.call(Client.java:1504)
    at org.apache.hadoop.ipc.Client.call(Client.java:1441)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
    at com.sun.proxy.$Proxy16.mkdirs(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:573)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
    at com.sun.proxy.$Proxy17.mkdirs(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3162)
    ... 20 more

19/05/05 20:05:00 INFO client.AngelClient: stop the application 19/05/05 20:05:00 INFO client.AngelClient: master is null, just kill the application 19/05/05 20:05:00 ERROR yarn.AngelYarnClient: kill application failed, org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Trying to kill an absent application application_1550912047489_102232 at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.forceKillApplication(ClientRMService.java:618) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.forceKillApplication(ApplicationClientProtocolPBServiceImpl.java:155) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
    at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
    at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.forceKillApplication(ApplicationClientProtocolPBClientImpl.java:175)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
    at com.sun.proxy.$Proxy8.forceKillApplication(Unknown Source)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.killApplication(YarnClientImpl.java:371)
    at com.tencent.angel.client.yarn.AngelYarnClient.kill(AngelYarnClient.java:183)
    at com.tencent.angel.client.AngelClient.stop(AngelClient.java:407)
    at com.tencent.angel.client.AngelClient.stop(AngelClient.java:413)
    at com.tencent.angel.ml.core.graphsubmit.GraphRunner.train(GraphRunner.scala:69)
    at com.tencent.angel.ml.core.MLRunner$class.submit(MLRunner.scala:88)
    at com.tencent.angel.ml.core.graphsubmit.GraphRunner.submit(GraphRunner.scala:29)
    at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:91)
    at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:77)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
    at com.tencent.angel.utils.AngelRunJar.submit(AngelRunJar.java:77)
    at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:44)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException): Trying to kill an absent application application_1550912047489_102232 at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.forceKillApplication(ClientRMService.java:618) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.forceKillApplication(ApplicationClientProtocolPBServiceImpl.java:155) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

    at org.apache.hadoop.ipc.Client.call(Client.java:1504)
    at org.apache.hadoop.ipc.Client.call(Client.java:1441)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
    at com.sun.proxy.$Proxy7.forceKillApplication(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.forceKillApplication(ApplicationClientProtocolPBClientImpl.java:172)
    ... 21 more

19/05/05 20:05:00 ERROR utils.AngelRunJar: submit job failed com.tencent.angel.exception.AngelException: org.apache.hadoop.security.AccessControlException: Permission denied: user=stat, access=WRITE, inode="/tmp/hadoop-yarn":yarn:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:279) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:260) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:240) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:162) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3887) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3870) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:3852) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6762) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4503) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4473) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4446) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:966) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:326) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:640) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

    at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:176)
    at com.tencent.angel.ml.core.graphsubmit.GraphRunner.train(GraphRunner.scala:55)
    at com.tencent.angel.ml.core.MLRunner$class.submit(MLRunner.scala:88)
    at com.tencent.angel.ml.core.graphsubmit.GraphRunner.submit(GraphRunner.scala:29)
    at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:91)
    at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:77)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
    at com.tencent.angel.utils.AngelRunJar.submit(AngelRunJar.java:77)
    at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:44)

Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=stat, access=WRITE, inode="/tmp/hadoop-yarn":yarn:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:279) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:260) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:240) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:162) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3887) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3870) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:3852) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6762) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4503) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4473) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4446) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:966) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:326) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:640) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
    at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3164)
    at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:3129)
    at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1007)
    at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1003)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:1003)
    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:995)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1970)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:614)
    at com.tencent.angel.client.yarn.AngelYarnClient.copyAndConfigureFiles(AngelYarnClient.java:208)
    at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:153)
    ... 10 more

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=stat, access=WRITE, inode="/tmp/hadoop-yarn":yarn:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:279) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:260) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:240) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:162) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3887) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3870) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:3852) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6762) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4503) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4473) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4446) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:966) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:326) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:640) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

    at org.apache.hadoop.ipc.Client.call(Client.java:1504)
    at org.apache.hadoop.ipc.Client.call(Client.java:1441)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
    at com.sun.proxy.$Proxy16.mkdirs(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:573)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
    at com.sun.proxy.$Proxy17.mkdirs(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3162)
    ... 20 more
wqh17101 commented 5 years ago

LR script image

wqh17101 commented 5 years ago

/tmp/hadoop-yarn 这个路径似乎是存放log的,我在脚本中设置了log路径名,为啥会写到这

paynie commented 5 years ago

"/tmp/hadoop-yarn" on ${fs.defaultFS} is the default angel stage directory. the stage path is used to upload resource files in Yarn applitions. The error message is that you don't have permission to this directory. You can configure the angel.staging.dir parameter to modify the angel stage directory and configure it as a directory for which you have permissions.

wqh17101 commented 5 years ago

@paynie 好的,lr的问题解决了,那LDA的那个问题呢

wqh17101 commented 5 years ago

@paynie 目前来看是参数配置不生效,我调整了脚本写法之后就可以正确运行了

sh ./angel-submit \
-Daction.type=train \
-Dangel.app.submit.class=com.tencent.angel.ml.lda.LDARunner \
-Dml.model.class.name=com.tencent.angel.ml.lda.LDAModel \
-Dangel.train.data.path="hdfs://jr-hdfs//tmp/wangqinghua/lda/angel_test/data/nips.doc" \
-Dangel.log.path="hdfs://jr-hdfs//tmp/wangqinghua/lda/angel_test/log" \
-Dangel.save.model.path="hdfs://jr-hdfs//tmp/wangqinghua/lda/angel_test/model" \
-Dsave.doc.topic=true \
-Dsave.word.topic=true \
-Dml.epoch.num=10 \
-Dml.data.type=dummy \
-Dml.feature.index.range=1024 \
-Dangel.job.name=LDAtest \
-Dangel.am.memory.gb=2 \
-Dangel.worker.memory.gb=2 \
-Dangel.ps.memory.gb=2 \
-Dangel.staging.dir="hdfs://jr-hdfs//tmp/wangqinghua/lda/angel_test/stage" \
--queue datamin.default \
-Dangel.output.path.deleteonexist=true

建议提示用户使用-D的写法

运行起来之后遇到

19/05/06 11:53:13 ERROR utils.AngelRunJar: submit job failed 
com.tencent.angel.exception.AngelException: com.tencent.angel.exception.AngelException: matrix vocabulary parameter is invalid, 
nonzero index range start can only be use sparse model type now, but model type now is T_INT_DENSE with index range start value = -2147483648
        at com.tencent.angel.client.AngelClient.createMatrices(AngelClient.java:760)
        at com.tencent.angel.client.AngelClient.loadModel(AngelClient.java:228)
        at com.tencent.angel.ml.lda.LDARunner.train(LDARunner.scala:108)
        at com.tencent.angel.ml.core.MLRunner$class.submit(MLRunner.scala:88)
        at com.tencent.angel.ml.lda.LDARunner.submit(LDARunner.scala:29)
        at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:91)
        at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:77)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
        at com.tencent.angel.utils.AngelRunJar.submit(AngelRunJar.java:77)
        at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:44)
Caused by: com.tencent.angel.exception.AngelException: matrix vocabulary parameter is invalid, nonzero index range start can only be use sparse model type now, 
but model type now is T_INT_DENSE with index range start value = -2147483648
        at com.tencent.angel.ml.matrix.MatrixContext.check(MatrixContext.java:613)
        at com.tencent.angel.ml.matrix.MatrixContext.init(MatrixContext.java:554)
        at com.tencent.angel.client.AngelClient.createMatrices(AngelClient.java:752)
        ... 11 more
wqh17101 commented 5 years ago

另外 在正常yarn 上运行LR 和失败运行LDA上都有这个问题,不知道是否有影响 image

paynie commented 5 years ago

You should set worker number and ps number by "angel.workergroup.number" and "angel.ps.number"

wqh17101 commented 5 years ago

这个默认不是都是1么,为什么需要设置? 另外,我设置了之后得到了. image 似乎在 https://github.com/Angel-ML/angel/issues/237 中出现过

wqh17101 commented 5 years ago

it worked!需要更加友好的文档 @paynie

wqh17101 commented 5 years ago

在LDAModel中

  // Initializing parameters
  //  var V: Int = 0
  //  val path = conf.get(WORD_NUM_PATH)
  //  if (path != null && path.length > 0)
  //    V = HDFSUtils.readFeatureNum(path, conf)
  //  else
  //    V = conf.getInt(WORD_NUM, 1)

发现了这几行注释代码,这个功能很不错呀,是被什么替换掉了么 @paynie