Open UntaggedRui opened 5 years ago
Hi, thanks for using SparkRDMA.
Make sure /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
is accessible from both master and executor. You can run something like: ./spark/sbin/slaves.sh ls -al /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
Can you please describe your cluster (how many nodes, how much CPU do you have, what the NIC you are using) ?
Thanks, Peter
Hi, thanks for using SparkRDMA.
- Make sure
/home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
is accessible from both master and executor. You can run something like:./spark/sbin/slaves.sh ls -al /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
- Can you please describe your cluster (how many nodes, how much CPU do you have, what the NIC you are using) ?
Thanks, Peter
Thanks for replying to my question.
[rui@rdma-server-204 spark-2.4.0-bin-hadoop2.7]$ sbin/slaves.sh ls -al /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
master: -rwxr-xr-x 1 rui rui 478528 Nov 29 2018 /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
slave1: -rwxr-xr-x 1 rui rui 478528 Nov 29 2018 /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
[rui@rdma-server-204 spark-2.4.0-bin-hadoop2.7]$ lspci | grep Mell
09:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
09:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
My worker node nic is
[rui@supervisor-1 ~]$ lspci | grep Mell
05:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
05:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
06:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
06:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
So, --deploy-mode cluster
is for yarn deployment mode only. You are using standalone mode. So this parameter is not needed. Also, 2 nodes could be quite small to get the full benefit of RDMA. You could try first to run whether disni RdmaVsTCPBenchmark or UCX bandwith benchmark
BTW we're in progress of releasing Spark over UCX. It'll have all the functionality of SparkRDMA + better performance + support other transports (cuda, shared memory) and protocols. Keep in touch: https://github.com/openucx/sparkucx/
So,
--deploy-mode cluster
is for yarn deployment mode only. You are using standalone mode. So this parameter is not needed. Also, 2 nodes could be quite small to get the full benefit of RDMA. You could try first to run whether disni RdmaVsTCPBenchmark or UCX bandwith benchmarkBTW we're in progress of releasing Spark over UCX. It'll have all the functionality of SparkRDMA + better performance + support other transports (cuda, shared memory) and protocols. Keep in touch: https://github.com/openucx/sparkucx/
Oh,thank you very much.
the spark-defaults.conf of both workers and master are
and I have placed
libdisni.so
in/usr/lib
.When I run TeraGen build from spark-terasort,I can run it with --master spark://master:7077 --deploy-mode client,which whole command is
but failed with ClassNotFoundException while using --master spark://master:7077 --deploy-mode cluster,which whole command is
the error info is
How should I fix it? What's more, in client deployment , when I generate 50GB data using teragen and use raw spark with no spark-defaults.conf, the transfer speed between master and slave is about 270MB/s. However,when I change my spark-defaults.conf and replace spark.shuffle.manager to org.apache.spark.shuffle.rdma.RdmaShuffleManager, the speed is also 270MB/s. Is this because I used hdfs storage but it has nothing to do with spark shuffle? Can you recommed a workload for me to significantly improve completion time when using spark rdma? Thanks a lot !