Tencent / Firestorm

Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark and Apache Hadoop MapReduce applications to store shuffle data on remote servers
Other
252 stars 72 forks source link

Errors in settings #103

Closed avs-alatau closed 2 years ago

avs-alatau commented 2 years ago

Hi,

Please help me figure out the RSS settings My Cluster hadoop 3.1.3 spark 3.2

Now YARN Shuffle Service I want to set up an external RSS

sandbox02 – coordinator1 sandbox03 – coordinator2 sandbox04 – server1 RSS sandbox05 – service2 RSS

My settings are like this now

Coordinator

rss.rpc.server.port 19999
rss.jetty.http.port 19998
rss.coordinator.server.heartbeat.timeout 30000
rss.coordinator.app.expired 60000
rss.coordinator.shuffle.nodes.max 3
rss.coordinator.exclude.nodes.file.path /opt/rss_fs/conf/exclude_nodes

Server

rss.rpc.server.port 19999
rss.jetty.http.port 19998
rss.storage.basePath /srv/data/01/rssdata,/srv/data/02/rssdata,/srv/data/03/rssdata
rss.storage.type MEMORY_HDFS
rss.coordinator.quorum sandbox02:19999,sandbox03:19999
rss.server.buffer.capacity 5gb
rss.server.buffer.spill.threshold 2gb
rss.server.partition.buffer.size 50mb
rss.server.read.buffer.capacity 5gb
rss.server.flush.thread.alive 50
rss.server.flush.threadPool.size 100
rss.server.hdfs.base.path hdfs://sandbox-test/rss

Spark - spark-defaults.conf

...
spark.master                                          yarn
spark.jars                                            hdfs://sandbox-test/spark3/jars/rss-client-spark3-0.3.0-shaded.jar
spark.yarn.archive                                    hdfs://sandbox-test/spark3/yarn/archive.zip

spark.shuffle.service.enabled                         false
spark.dynamicAllocation.enabled                       false

spark.shuffle.manager                                 org.apache.spark.shuffle.RssShuffleManager
spark.rss.coordinator.quorum                          10.3.118.10:19999,10.3.118.11:19999
spark.rss.storage.type                                MEMORY_HDFS
spark.rss.base.path                                   hdfs://sandbox-test/rss

YARN - yarn-site.xml

...
<property>
  <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
  <value>org.apache.spark.shuffle.RssShuffleManager</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.spark_shuffle.classpath</name>
  <value>/opt/rss_fs/jars/client/spark3/rss-client-spark3-0.3.0-shaded.jar</value>
</property>
...

With such settings, I can't even run spark-shell I have tried different options for settings, but so far without result. There are no errors in the Coordinator and Server logs.

Coordinator logs:

[INFO] 2022-04-07 20:22:22,532 main CoordinatorServer main - Start to init coordinator server using config ./conf/coordinator.conf
[INFO] 2022-04-07 20:22:22,542 main RssUtils getPropertiesFromFile - Load config from ./conf/coordinator.conf
[INFO] 2022-04-07 20:22:22,680 main CoordinatorServer registerMetrics - Register metrics
[INFO] 2022-04-07 20:22:22,728 main CoordinatorServer registerMetrics - Add metrics servlet
[INFO] 2022-04-07 20:22:22,748 main CoordinatorServer addServlet - Add metrics servlet
[INFO] 2022-04-07 20:22:23,016 main Server doStart - jetty-9.0.2.v20130417
[INFO] 2022-04-07 20:22:23,058 main ContextHandler doStart - started o.e.j.s.ServletContextHandler@4a003cbe{/,null,AVAILABLE}
[INFO] 2022-04-07 20:22:23,076 main ServerConnector doStart - Started ServerConnector@7ea28149{HTTP/1.1}{0.0.0.0:19998}
[INFO] 2022-04-07 20:22:23,076 main JettyServer start - Jetty http server started, listening on port 19998
[INFO] 2022-04-07 20:22:23,149 main GrpcServer start - Grpc server started, listening on 19999.
[INFO] 2022-04-07 20:22:52,587 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 0 applications
[INFO] 2022-04-07 20:23:22,587 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 0 applications
[INFO] 2022-04-07 20:23:22,598 UpdateExcludeNodes-0 SimpleClusterManager parseExcludeNodesFile - Update exclude nodes and 0 nodes was marked as exclude nodes
[INFO] 2022-04-07 20:23:52,587 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 0 applications

Server logs:

[INFO] 2022-04-07 20:22:37,032 main ShuffleServer main - Start to init shuffle server using config ./conf/server.conf
[INFO] 2022-04-07 20:22:37,042 main RssUtils getPropertiesFromFile - Load config from ./conf/server.conf
[INFO] 2022-04-07 20:22:37,078 main ShuffleServer initialization - Start to initialize server 172.18.0.1-19999
[INFO] 2022-04-07 20:22:37,136 main ShuffleServer registerMetrics - Register metrics
[INFO] 2022-04-07 20:22:37,183 main ShuffleServer registerMetrics - Add metrics servlet
[INFO] 2022-04-07 20:22:37,254 main CoordinatorClientFactory createCoordinatorClient - Start to create coordinator clients from sandbox02:19999,sandbox03:19999
[INFO] 2022-04-07 20:22:37,501 main CoordinatorClientFactory createCoordinatorClient - Add coordinator client Coordinator grpc client ref to sandbox02:19999
[INFO] 2022-04-07 20:22:37,505 main CoordinatorClientFactory createCoordinatorClient - Add coordinator client Coordinator grpc client ref to sandbox03:19999
[INFO] 2022-04-07 20:22:37,511 main CoordinatorClientFactory createCoordinatorClient - Finish create coordinator clients Coordinator grpc client ref to sandbox02:19999, Coordinator grpc client ref to sandbox03:19999
[INFO] 2022-04-07 20:22:37,613 main RegisterHeartBeat startHeartBeat - Start heartbeat to coordinator sandbox02:19999,sandbox03:19999 after 10000ms and interval is 10000ms
[INFO] 2022-04-07 20:22:37,617 main Server doStart - jetty-9.0.2.v20130417
[INFO] 2022-04-07 20:22:37,656 main ContextHandler doStart - started o.e.j.s.ServletContextHandler@205d38da{/,null,AVAILABLE}
[INFO] 2022-04-07 20:22:37,675 main ServerConnector doStart - Started ServerConnector@73eedaf0{HTTP/1.1}{0.0.0.0:19998}
[INFO] 2022-04-07 20:22:37,676 main JettyServer start - Jetty http server started, listening on port 19998
[INFO] 2022-04-07 20:22:37,752 main GrpcServer start - Grpc server started, listening on 19999.
[INFO] 2022-04-07 20:22:37,753 main ShuffleServer start - Shuffle server start successfully!

Can you point out my mistakes in the settings?

colinmjj commented 2 years ago
  1. yarn-site.xml shouldn't be changed for rss
  2. rss.server.buffer.capacity contains rss.server.read.buffer.capacity
avs-alatau commented 2 years ago

Launches fall with an error. Do you have any ideas how to solve it?

2022-04-08 21:20:01,367 INFO client.TransportClientFactory: Successfully created connection to sandbox04/10.3.118.12:41573 after 6 ms (0 ms spent in bootstraps)
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:419)
        at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
        at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.RssShuffleManager

Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.RssShuffleManager

used rss-client-spark3-0.4.0-shaded.jar

avs-alatau commented 2 years ago

I managed to run with spark versions 3.2.1 And with version 3.2.0 I had errors all the time

jerqi commented 2 years ago

You can try remove spark.yarn.archive and spark.jars, and add the rss jar to the directory jars of spark client.