Closed li7hui closed 6 years ago
Hi, The mismatch in size usually doesn't result in failures, so I think the issue is with binding to the right RDMA device. Is 172.31.101.104 your RDMA device IP address?
Hi, the 172.31.101.104 is the internet IP address. the 10.10.10.104 is the intranet IP address for RoCE network. The 10.10.10.104 is already binded to mlx_bond_0. How shall I let SparkRDMA know this configuration? If I modify the /etc/hosts to point to 10.10.10.104 network, the Spark will not work...
Since your hostname points to 172.31.101.104, Spark will bind to 172.31.101.104 by default, and SparkRDMA will follow.
One option to overcome this issue without changing your hosts file is by adding these lines to your spark-env.sh:
export SPARK_MASTER_HOST=/usr/sbin/ip addr show <THE NETWORK DEVICE NAME OF YOUR RDMA DEVICE> | grep "inet\b" | awk '{print $2}' | cut -d/ -f1
e.g., in my system for example, the master node RDMA IP address is "192.168.1.12", and the RDMA network device name (as it appears in ifconfig) is "ens2", so this how these line work on my setup:
export SPARK_MASTER_HOST=192.168.1.12
export SPARK_LOCAL_IP=/usr/sbin/ip addr show ens2 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1
The above assumes you are running in standalone mode, let me know if you are running in a different mode.
@yuvaldeg hi, this is really help! i am testing this now.
@yuvaldeg good news, after setting up the SPARK_LOCAL_IP, i can run the TeraSort with SparkRDMA successfully now. You can close this issue now. many thanks.
Happy to help! Please let us know if you stumble upon any other issues
The SPARK_LOCAL_IP
seems only work for standalone mode.
How can we set the IP for workers if we submit Spark jobs with Yarn-cluster mode? @yuvaldeg
Hello,
I am testing the SparkRDMA with Mellanox ConnectX-4Lx card. I installed the Spark-2.2.0 and download SparkTeraSort sample code. The sparkterasort sample code can ran successfully with spark-2.2.0, however, when run the terasort code with the SparkRDMA plugin, it throws out error which is show as following picture. Do I need upgrade libibverb.so or do I need configure the RDMA network for Spark? Please help.