Angel-ML / sona

Spark On Angel, arming Spark with a powerful Parameter Server, which enable Spark to train very big models
Apache License 2.0
84 stars 50 forks source link

run demo of sona latest version bug #62

Open lcx517 opened 4 years ago

lcx517 commented 4 years ago

Hi, I'm running SONA-example,and got FAILED with stdout log here. PLEASE HELP~~

2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for TERM
2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for HUP
2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for INT
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing view acls to: deepthought
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing modify acls to: deepthought
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-12-26 14:09:19 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(deepthought); groups with view permissions: Set(); users  with modify permissions: Set(deepthought); groups with modify permissions: Set()
2019-12-26 14:09:20 INFO  UserGroupInformation:964 - Login successful for user deepthought using keytab file deepthought.keytab-4169bc48-f895-42c2-9dde-091feb49f3c5
2019-12-26 14:09:20 INFO  ApplicationMaster:54 - Preparing Local resources
2019-12-26 14:09:22 WARN  Client:677 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - ApplicationAttemptId: appattempt_1576380960005_2467808_000001
2019-12-26 14:09:28 INFO  AMCredentialRenewer:54 - Scheduling login from keytab in 64776907 millis.
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - Starting the user application in a separate Thread
2019-12-26 14:09:28 ERROR ApplicationMaster:91 - Uncaught exception: 
java.lang.ClassNotFoundException: org.apache.spark.angel.examples.JsonRunnerExamples
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.deploy.yarn.ApplicationMaster.startUserApplication(ApplicationMaster.scala:715)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:491)
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
    at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.lang.ClassNotFoundException: org.apache.spark.angel.examples.JsonRunnerExamples)
2019-12-26 14:09:28 INFO  ShutdownHookManager:54 - Shutdown hook called

my SONA-example script:

source ./spark-on-angel-env.sh
export HADOOP_CONF_DIR=/usr/lib/hadoop/etc/hadoop

$SPARK_HOME/bin/spark-submit \
        --master yarn-cluster \
        --driver-java-options "-Djava.library.path=/usr/lib/hadoop/lib/native" \
        --keytab /home/deepthought/deepthought.keytab \
        --principal deepthought \
        --queue longyuan.p0 \
    --conf spark.ps.jars=$SONA_ANGEL_JARS \
    --conf spark.ps.instances=10 \
    --conf spark.ps.cores=2 \
    --conf spark.ps.memory=6g \
    --jars $SONA_SPARK_JARS\
    --name "LR-spark-on-angel" \
    --files /data/angel/sona-0.1.0-bin/jsons/logreg.json \
    --driver-memory 10g \
    --num-executors 10 \
    --executor-cores 2 \
    --executor-memory 4g \
    --class org.apache.spark.angel.examples.JsonRunnerExamples \
    ./../lib/angelml-${SONA_VERSION}.jar \
    data:viewfs://hadoop-bd/user/deepthought/test/angel/sona-0.1.0-bin/data/angel/a9a/a9a_123d_train.libsvm \
    modelPath:viewfs://hadoop-bd/user/deepthought/test/output \
    jsonFile:./lr.json \
    lr:0.1

and my spark-on-angel-env.sh:


export JAVA_HOME=/usr
export HADOOP_HOME=/usr/lib/hadoop
export SPARK_HOME=/usr/local/spark/spark-2.3.1-bin-hadoop2.6
export SONA_HOME=/data/angel/sona-0.1.0-bin
export SONA_HDFS_HOME=viewfs://hadoop-bd/user/deepthought/test/angel/sona-0.1.0-bin
export SONA_VERSION=0.1.0
export ANGEL_VERSION=3.0.1
export ANGEL_UTILS_VERSION=0.1.1
export ANGEL_MLCORE_VERSION=0.1.2

...<not changed default content below>...```
PayneJoe commented 4 years ago

Hi, I'm running SONA-example,and got FAILED with stdout log here. PLEASE HELP~~

2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for TERM
2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for HUP
2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for INT
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing view acls to: deepthought
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing modify acls to: deepthought
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-12-26 14:09:19 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(deepthought); groups with view permissions: Set(); users  with modify permissions: Set(deepthought); groups with modify permissions: Set()
2019-12-26 14:09:20 INFO  UserGroupInformation:964 - Login successful for user deepthought using keytab file deepthought.keytab-4169bc48-f895-42c2-9dde-091feb49f3c5
2019-12-26 14:09:20 INFO  ApplicationMaster:54 - Preparing Local resources
2019-12-26 14:09:22 WARN  Client:677 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - ApplicationAttemptId: appattempt_1576380960005_2467808_000001
2019-12-26 14:09:28 INFO  AMCredentialRenewer:54 - Scheduling login from keytab in 64776907 millis.
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - Starting the user application in a separate Thread
2019-12-26 14:09:28 ERROR ApplicationMaster:91 - Uncaught exception: 
java.lang.ClassNotFoundException: org.apache.spark.angel.examples.JsonRunnerExamples
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at org.apache.spark.deploy.yarn.ApplicationMaster.startUserApplication(ApplicationMaster.scala:715)
  at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:491)
  at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
  at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
  at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
  at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
  at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
  at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
  at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
  at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
  at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.lang.ClassNotFoundException: org.apache.spark.angel.examples.JsonRunnerExamples)
2019-12-26 14:09:28 INFO  ShutdownHookManager:54 - Shutdown hook called

my SONA-example script:

source ./spark-on-angel-env.sh
export HADOOP_CONF_DIR=/usr/lib/hadoop/etc/hadoop

$SPARK_HOME/bin/spark-submit \
        --master yarn-cluster \
        --driver-java-options "-Djava.library.path=/usr/lib/hadoop/lib/native" \
        --keytab /home/deepthought/deepthought.keytab \
        --principal deepthought \
        --queue longyuan.p0 \
  --conf spark.ps.jars=$SONA_ANGEL_JARS \
  --conf spark.ps.instances=10 \
  --conf spark.ps.cores=2 \
  --conf spark.ps.memory=6g \
  --jars $SONA_SPARK_JARS\
  --name "LR-spark-on-angel" \
  --files /data/angel/sona-0.1.0-bin/jsons/logreg.json \
  --driver-memory 10g \
  --num-executors 10 \
  --executor-cores 2 \
  --executor-memory 4g \
  --class org.apache.spark.angel.examples.JsonRunnerExamples \
  ./../lib/angelml-${SONA_VERSION}.jar \
  data:viewfs://hadoop-bd/user/deepthought/test/angel/sona-0.1.0-bin/data/angel/a9a/a9a_123d_train.libsvm \
  modelPath:viewfs://hadoop-bd/user/deepthought/test/output \
  jsonFile:./lr.json \
  lr:0.1

and my spark-on-angel-env.sh:

export JAVA_HOME=/usr
export HADOOP_HOME=/usr/lib/hadoop
export SPARK_HOME=/usr/local/spark/spark-2.3.1-bin-hadoop2.6
export SONA_HOME=/data/angel/sona-0.1.0-bin
export SONA_HDFS_HOME=viewfs://hadoop-bd/user/deepthought/test/angel/sona-0.1.0-bin
export SONA_VERSION=0.1.0
export ANGEL_VERSION=3.0.1
export ANGEL_UTILS_VERSION=0.1.1
export ANGEL_MLCORE_VERSION=0.1.2

...<not changed default content below>...```

class changed aleady, while doc is outdated!

You need to change "org.apache.spark.angel.examples.JsonRunnerExamples" to "com.tencent.angel.sona.examples.JsonRunnerExamples".

luck~