Angel-ML / PyTorch-On-Angel

PyTorch On Angel, arming PyTorch with a powerful Parameter Server, which enable PyTorch to train very big models.
164 stars 51 forks source link

报错:NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize #24

Open fy88fy opened 3 years ago

fy88fy commented 3 years ago

你好,问一下? 我在测试使用PyTorch-On-Angel ,提交报错了,在代码里找到的是com.tencent.angel.graph.utils.params.HasBatchSize,会是什么原因导致的呢?

param mode = yarn-client Exception in thread "main" java.lang.NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:53) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.tencent.angel.spark.ml.graph.params.HasBatchSize at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 25 more

ouyangwen-it commented 3 years ago

你使用的是哪个分支,spark是哪个版本,用最新的0.2.1分支试下

dongxuej commented 3 years ago

pytorch-on-angel用的是master的,spark是2.4.5U2,我们用的是纯cpu环境

ouyangwen-it commented 3 years ago

pytorch-on-angel用的是master的,spark是2.4.5U2,我们用的是纯cpu环境

用0.2.1分支,angel环境用3.1.0的

dongxuej commented 3 years ago

angel现在用的也是master的,那个就是3.1.0的吧??还是非得下载branch3.1.0

ouyangwen-it commented 3 years ago

angel现在用的也是master的,那个就是3.1.0的吧??还是非得下载branch3.1.0 master就可以了

dongxuej commented 3 years ago

这个classnotfound的问题已经解决了~我现在遇到了新的问题~

Exception in thread "main" java.lang.UnsatisfiedLinkError: no torch_angel in java.library.path

我给的参数如下: input=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/fanqizha/subgraph/input/20191231/ output=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/jiadongxue/angel/model/20191231_deepfm/ source ./spark-on-angel-env.sh echo "------------------"

JAVA_LIBRARY_PATH=/home/work/software/java/lib

JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit \ --conf spark.ps.instances=5 \ --conf spark.ps.cores=1 \ --conf spark.ps.jars=$SONA_ANGEL_JARS \ --conf spark.ps.memory=5g \ --conf spark.ps.log.level=INFO \ --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib \ --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib \ --conf spark.executor.extraLibraryPath=./torch/lib \ --conf spark.driver.extraLibraryPath=./torch/lib \ --conf spark.executorEnv.OMP_NUM_THREADS=2 \ --conf spark.executorEnv.MKL_NUM_THREADS=2 \ --conf spark.executorEnv.JAVA_HOME=/home/work/software/java \ --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java \ --jars $SONA_SPARK_JARS \ --name "deepfm for torch on angel" \ --archives /home/work/software/angel/bin/torchlib.zip#torch \ --files /home/work/software/angel/bin/deepfm.pt \ --driver-memory 5g \ --num-executors 5 \ --executor-cores 1 \ --executor-memory 5g \ --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample \ ./pytorch-on-angel-0.2.1.jar\ trainInput:$input batchSize:128 torchModelPath:deepfm.pt \ stepSize:0.001 numEpoch:10 testRatio:0.1 \ angelModelOutputPath:$output mode:yarn-client \

报错如下: image

ouyangwen-it commented 3 years ago

这个classnotfound的问题已经解决了~我现在遇到了新的问题~

Exception in thread "main" java.lang.UnsatisfiedLinkError: no torch_angel in java.library.path

我给的参数如下: input=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/fanqizha/subgraph/input/20191231/ output=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/jiadongxue/angel/model/20191231_deepfm/ source ./spark-on-angel-env.sh echo "------------------"

JAVA_LIBRARY_PATH=/home/work/software/java/lib

JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit --conf spark.ps.instances=5 --conf spark.ps.cores=1 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=5g --conf spark.ps.log.level=INFO --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib --conf spark.executor.extraLibraryPath=./torch/lib --conf spark.driver.extraLibraryPath=./torch/lib --conf spark.executorEnv.OMP_NUM_THREADS=2 --conf spark.executorEnv.MKL_NUM_THREADS=2 --conf spark.executorEnv.JAVA_HOME=/home/work/software/java --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java --jars $SONA_SPARK_JARS --name "deepfm for torch on angel" --archives /home/work/software/angel/bin/torchlib.zip#torch --files /home/work/software/angel/bin/deepfm.pt --driver-memory 5g --num-executors 5 --executor-cores 1 --executor-memory 5g --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample ./pytorch-on-angel-0.2.1.jar trainInput:$input batchSize:128 torchModelPath:deepfm.pt stepSize:0.001 numEpoch:10 testRatio:0.1 angelModelOutputPath:$output mode:yarn-client \

报错如下: image

image 应该是你的依赖包解压后的目录和你设置的不匹配,是这样的,spark的--archives参数会把你的hdfs上的压缩包解压到executor执行目录下,目录名是井号后面那个别名,目录应该是./torch/(你压缩包解压后的目录结构) --archives跟的是hdfs路径

fy88fy commented 3 years ago

你好,我现在遇到个问题,是提交不到yarn上。找不到hdfs上的deepfm.pt文件,麻烦帮忙看一下。 脚本配置如下:

#!/bin/bash
input=hdfs://xxxx-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp
output=hdfs://xxxx-1v/home/hdp/jia/angel/model/20191231_louvain/
source ./spark-on-angel-env.sh
echo "------------------"
#JAVA_LIBRARY_PATH=/home/work/software/java/lib
JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib
echo $JAVA_LIBRARY_PATH
$SPARK_HOME/bin/spark-submit \
       --conf spark.ps.instances=5 \
       --conf spark.ps.cores=1 \
       --conf spark.ps.jars=$SONA_ANGEL_JARS \
       --conf spark.ps.memory=5g \
       --conf spark.ps.log.level=INFO \
       --archives hdfs://XXXX-hadoop3-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch \
       --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib \
       --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib \
       --conf spark.executor.extraLibraryPath=./torch/lib \
       --conf spark.driver.extraLibraryPath=./torch/lib \
       --conf spark.executorEnv.OMP_NUM_THREADS=2 \
       --conf spark.executorEnv.MKL_NUM_THREADS=2 \
       --conf spark.executorEnv.JAVA_HOME=/home/work/software/java \
       --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java \
       --conf spark.hadoop.fs.defaultFS=hdfs://xxxx-hadoop3-1v/ \
       --jars $SONA_SPARK_JARS  \
       --name "deepfm for torch on angel" \
       --files deepfm.pt \
       --driver-memory 5g \
       --num-executors 5 \
       --executor-cores 1 \
       --executor-memory 5g \
       --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample \
       ./pytorch-on-angel-0.2.1.jar\
       trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
       stepSize:0.001 numEpoch:10 testRatio:0.1 \
       angelModelOutputPath:$output mode:yarn-client \

image

ouyangwen-it commented 3 years ago

你spark用yarn-cluster模式提交试试呢

fy88fy commented 3 years ago

改为yarn-cluster报如下错误了: image

image

ouyangwen-it commented 3 years ago

你这个torchlib.zip压缩包解压的目录结构是什么样的

fy88fy commented 3 years ago

torchlib.zip解压开是lib目录,lib下是很多.a文件 image

ouyangwen-it commented 3 years ago

torchlib.zip解压开是lib目录,lib下是很多.a文件 image

你可以在RecommendationExample里面把当前目录打印下看看吗,看有没有torch/lib

fy88fy commented 3 years ago

你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。

21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING)
21/01/13 13:07:45 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 10.x.121.219
     ApplicationMaster RPC port: -1
     queue: root.default
     start time: 1610514454704
     final status: UNDEFINED
     tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/
     user: user-2001
numDataPartitions=7500
numDataPartitions=7500
type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1
optimizer: AsyncAdam eta=0.001 decay=0.001
from driver start Angel PS! 
AppMaster capability = <memory:2048, vCores:1, gCores:0>
validate_auc=0.8820555586167144 time=12161ms                                     
train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms        
train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms        
train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms        
train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms        
train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms         
train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms        
train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms        
train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms        
train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms        
Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log:
ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
    at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
    at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
    at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
    at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
    at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
    at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

    at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154)
    at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121)
    at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
    at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
    at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
    at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
    at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
    at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
    at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

    at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151)
    ... 3 more

    at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670)
    at com.tencent.angel.client.AngelClient.save(AngelClient.java:381)
    at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146)
    at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300)
    at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258)
    at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234)
    at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88)
    at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

`input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT} source /home/work/software/angel/bin/spark-on-angel-env.sh echo "------------------" JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib

echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit \ --conf spark.ps.instances=15 \ --conf spark.ps.cores=1 \ --conf spark.ps.jars=$SONA_ANGEL_JARS \ --conf spark.ps.memory=5g \ --conf spark.ps.log.level=INFO \ --archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch \ --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib \ --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib \ --conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib \ --conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib \ --conf spark.executorEnv.OMP_NUM_THREADS=2 \ --conf spark.executorEnv.MKL_NUM_THREADS=2 \ --conf spark.executorEnv.JAVA_HOME=/home/work/software/java \ --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java \ --conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/ \ --jars $SONA_SPARK_JARS \ --name "deepfm for torch" \ --driver-memory 5g \ --num-executors 15\ --executor-cores 5 \ --executor-memory 8g \ --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample \ /home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar\ trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt \ stepSize:0.001 numEpoch:10 testRatio:0.1 \ angelModelOutputPath:$output mode:yarn-client \ `

ouyangwen-it commented 3 years ago

你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。

21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING)
21/01/13 13:07:45 INFO Client: 
   client token: N/A
   diagnostics: N/A
   ApplicationMaster host: 10.x.121.219
   ApplicationMaster RPC port: -1
   queue: root.default
   start time: 1610514454704
   final status: UNDEFINED
   tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/
   user: user-2001
numDataPartitions=7500
numDataPartitions=7500
type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1
optimizer: AsyncAdam eta=0.001 decay=0.001
from driver start Angel PS! 
AppMaster capability = <memory:2048, vCores:1, gCores:0>
validate_auc=0.8820555586167144 time=12161ms                                     
train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms        
train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms        
train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms        
train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms        
train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms         
train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms        
train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms        
train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms        
train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms        
Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log:
ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
  at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
  at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
  at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
  at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
  at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
  at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
  at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
  at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
  at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
  at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

  at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154)
  at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121)
  at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
  at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
  at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
  at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
  at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
  at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
  at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
  at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
  at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
  at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
  at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

  at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151)
  ... 3 more

  at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670)
  at com.tencent.angel.client.AngelClient.save(AngelClient.java:381)
  at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146)
  at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300)
  at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258)
  at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234)
  at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88)
  at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
  at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
  at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

`input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT} source /home/work/software/angel/bin/spark-on-angel-env.sh echo "------------------" JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib

echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit --conf spark.ps.instances=15 --conf spark.ps.cores=1 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=5g --conf spark.ps.log.level=INFO --archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.executorEnv.OMP_NUM_THREADS=2 --conf spark.executorEnv.MKL_NUM_THREADS=2 --conf spark.executorEnv.JAVA_HOME=/home/work/software/java --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java --conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/ --jars $SONA_SPARK_JARS --name "deepfm for torch" --driver-memory 5g --num-executors 15 --executor-cores 5 --executor-memory 8g --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample /home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt stepSize:0.001 numEpoch:10 testRatio:0.1 angelModelOutputPath:$output mode:yarn-client `

看日志显示是保存模型的时候报错了,你可以看下ps端的日志

fy88fy commented 3 years ago

你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。

21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING)
21/01/13 13:07:45 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 10.x.121.219
     ApplicationMaster RPC port: -1
     queue: root.default
     start time: 1610514454704
     final status: UNDEFINED
     tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/
     user: user-2001
numDataPartitions=7500
numDataPartitions=7500
type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1
optimizer: AsyncAdam eta=0.001 decay=0.001
from driver start Angel PS! 
AppMaster capability = <memory:2048, vCores:1, gCores:0>
validate_auc=0.8820555586167144 time=12161ms                                     
train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms        
train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms        
train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms        
train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms        
train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms         
train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms        
train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms        
train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms        
train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms        
Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log:
ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
    at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
    at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
    at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
    at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
    at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
    at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

    at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154)
    at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121)
    at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
    at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
    at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
    at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
    at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
    at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
    at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

    at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151)
    ... 3 more

    at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670)
    at com.tencent.angel.client.AngelClient.save(AngelClient.java:381)
    at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146)
    at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300)
    at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258)
    at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234)
    at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88)
    at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT} source /home/work/software/angel/bin/spark-on-angel-env.sh echo "------------------" JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit --conf spark.ps.instances=15 --conf spark.ps.cores=1 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=5g --conf spark.ps.log.level=INFO --archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.executorEnv.OMP_NUM_THREADS=2 --conf spark.executorEnv.MKL_NUM_THREADS=2 --conf spark.executorEnv.JAVA_HOME=/home/work/software/java --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java --conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/ --jars $SONA_SPARK_JARS --name "deepfm for torch" --driver-memory 5g --num-executors 15 --executor-cores 5 --executor-memory 8g --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample /home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt stepSize:0.001 numEpoch:10 testRatio:0.1 angelModelOutputPath:$output mode:yarn-client

看日志显示是保存模型的时候报错了,你可以看下ps端的日志

ps日志报错如下: image image image image

ouyangwen-it commented 3 years ago

查看具体的出错的ps ParameterServer_0的日志:查看方法参考文档:https://github.com/Angel-ML/angel/wiki/%E5%B7%A5%E7%A8%8B%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98