Mellanox / SparkRDMA

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx
Apache License 2.0
240 stars 70 forks source link

[SparkRDMA] spark-terasort job hangs at last stage when using spark-rdma plugin #6

Closed li7hui closed 6 years ago

li7hui commented 6 years ago

Hi,

I was profling the spark-terasort jobs with SparkRDMA. The spark-terasort code is downloaded from here: https://github.com/fengshenwu/spark-terasort The above code can run successfully with spark-2.2.0 on 4 compute nodes. However, when using the spark rdma plugin, the job seems able to complete, but hangs at last stage, and never release the rdma buffer. please look at the following pictures: job hangs nojobrunning

yuvaldeg commented 6 years ago

Hi Hui, Thanks for reporting back, and nice work getting everything to run. We are aware of this issue, and already have a solution for it which will be published soon. For now, you can workaround the issue by adding this line to end of the test code: "sc.stop()" or "System.exit(0)". This should explicitly close the application. The issue is that the SparkRDMA threads were created as non-daemon threads, which causes them to hang if the Java applet ends without explicitly exiting. In the fix that we are going to introduce soon, we have changed those threads to daemon threads.

Please let me know if you are still having any issues.

Thanks, Yuval.

li7hui commented 6 years ago

Hi Yuval, After adding the System.exit(0) to Scala code, the error has gone.

There is still one question about the experiments results of Terasort with SparkRDMA. We did several experiments of Terasort of Spark RDMA on 4 compute nodes with 16Gb input data. We found that the results without using RDMA (3.6 minutes) is even faster than that of using RDMA (5 minutes). Both type of experiments using the RoCE network. Could you let me know your experiment settings of Terasort for both Vanilla and RDMA?

Thanks & Regards,

yuvaldeg commented 6 years ago

Hi Hui, Glad to hear that has resolved the error. Regarding the performance results, the higher numbers for RDMA looks like a very common issue where the fabric is not configured correctly for flow control, which is essential for RoCE to perform well. In the below link, you can find our tuning guide that also includes steps for verifying that flow control is configured correctly: https://github.com/Mellanox/SparkRDMA/wiki/Performance-Tuning-for-Mellanox-Adapters I suggest that you give it a try and report back, I will provide more performance tips if needed.

Thanks, Yuval.