Mellanox / SparkRDMA

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx
Apache License 2.0
241 stars 70 forks source link

ClassNotFoundException: org.apache.spark.shuffle.rdma.RdmaShuffleManager #36

Open UntaggedRui opened 5 years ago

UntaggedRui commented 5 years ago

the spark-defaults.conf of both workers and master are

spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.driver.extraClassPath /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
spark.executor.extraClassPath /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar

and I have placed libdisni.so in /usr/lib.
When I run TeraGen build from spark-terasort,I can run it with --master spark://master:7077 --deploy-mode client,which whole command is

spark-submit  --master spark://master:7077 --deploy-mode client --class com.github.ehiggs.spark.terasort.TeraGen  /home/rui/software/spark-2.4.0-bin-hadoop2.7/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar  1g hdfs://master:9000/data/terasort_in1g

but failed with ClassNotFoundException while using --master spark://master:7077 --deploy-mode cluster,which whole command is

spark-submit  --master spark://master:7077 --deploy-mode cluster --class com.github.ehiggs.spark.terasort.TeraGen  /home/rui/software/spark-2.4.0-bin-hadoop2.7/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar  1g hdfs://master:9000/data/terasort_in1g

the error info is

Launch Command: "/home/rui/software/jdk1.8.0_212/bin/java" "-cp" "/home/rui/software/spark-2.4.0-bin-hadoop2.7/conf/:/home/rui/software/spark-2.4.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.extraClassPath=/home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar" "-Dspark.driver.supervise=false" "-Dspark.submit.deployMode=cluster" "-Dspark.master=spark://master:7077" "-Dspark.driver.extraClassPath=/home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar" "-Dspark.jars=file:/home/rui/software/spark-2.4.0-bin-hadoop2.7/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar" "-Dspark.rpc.askTimeout=10s" "-Dspark.app.name=com.github.ehiggs.spark.terasort.TeraGen" "-Dspark.shuffle.manager=org.apache.spark.shuffle.rdma.RdmaShuffleManager" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@192.168.2.204:43489" "/home/rui/software/spark-2.4.0-bin-hadoop2.7/work/driver-20190930101138-0006/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar" "com.github.ehiggs.spark.terasort.TeraGen" "5g" "hdfs://master:9000/data/terasort_in5g2"
========================================

Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65)
    at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.rdma.RdmaShuffleManager
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
    at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:259)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:323)
    at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
    at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:424)
    at com.github.ehiggs.spark.terasort.TeraGen$.main(TeraGen.scala:48)
    at com.github.ehiggs.spark.terasort.TeraGen.main(TeraGen.scala)
    ... 6 more

How should I fix it? What's more, in client deployment , when I generate 50GB data using teragen and use raw spark with no spark-defaults.conf, the transfer speed between master and slave is about 270MB/s. However,when I change my spark-defaults.conf and replace spark.shuffle.manager to org.apache.spark.shuffle.rdma.RdmaShuffleManager, the speed is also 270MB/s. Is this because I used hdfs storage but it has nothing to do with spark shuffle? Can you recommed a workload for me to significantly improve completion time when using spark rdma? Thanks a lot !

petro-rudenko commented 5 years ago

Hi, thanks for using SparkRDMA.

  1. Make sure /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar is accessible from both master and executor. You can run something like: ./spark/sbin/slaves.sh ls -al /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar

  2. Can you please describe your cluster (how many nodes, how much CPU do you have, what the NIC you are using) ?

Thanks, Peter

UntaggedRui commented 5 years ago

Hi, thanks for using SparkRDMA.

  1. Make sure /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar is accessible from both master and executor. You can run something like: ./spark/sbin/slaves.sh ls -al /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
  2. Can you please describe your cluster (how many nodes, how much CPU do you have, what the NIC you are using) ?

Thanks, Peter

Thanks for replying to my question.

  1. Yes, I sure.The result is
    [rui@rdma-server-204 spark-2.4.0-bin-hadoop2.7]$ sbin/slaves.sh ls -al /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
    master: -rwxr-xr-x 1 rui rui 478528 Nov 29  2018 /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
    slave1: -rwxr-xr-x 1 rui rui 478528 Nov 29  2018 /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
  2. There are two servers in my cluster. A server is both a master and a worker, and other one is a worker in the cluster. Master has 2*24 cores, worker has 2*12 cores. The detail information is cluster info My Master node nic is
    [rui@rdma-server-204 spark-2.4.0-bin-hadoop2.7]$ lspci | grep Mell
    09:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
    09:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

    My worker node nic is

    [rui@supervisor-1 ~]$ lspci | grep Mell
    05:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
    05:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
    06:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
    06:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
petro-rudenko commented 5 years ago

So, --deploy-mode cluster is for yarn deployment mode only. You are using standalone mode. So this parameter is not needed. Also, 2 nodes could be quite small to get the full benefit of RDMA. You could try first to run whether disni RdmaVsTCPBenchmark or UCX bandwith benchmark

BTW we're in progress of releasing Spark over UCX. It'll have all the functionality of SparkRDMA + better performance + support other transports (cuda, shared memory) and protocols. Keep in touch: https://github.com/openucx/sparkucx/

UntaggedRui commented 5 years ago

So, --deploy-mode cluster is for yarn deployment mode only. You are using standalone mode. So this parameter is not needed. Also, 2 nodes could be quite small to get the full benefit of RDMA. You could try first to run whether disni RdmaVsTCPBenchmark or UCX bandwith benchmark

BTW we're in progress of releasing Spark over UCX. It'll have all the functionality of SparkRDMA + better performance + support other transports (cuda, shared memory) and protocols. Keep in touch: https://github.com/openucx/sparkucx/

Oh,thank you very much.