Hi
I attached the log for your reference. Please help me to check it, thansks!
Because my servers are ARMv8 64bit server, so I have downloaded the MLNX_OFED_LINUX-4.4-2.0.7.0-rhel7.5alternate-aarch64.tgz and have compiled it and installed it, and used perftest to test it
and make sure the physical RDMA is ok.
I have setup two nodes for hadoop(hadoop-2.7.1) and spark(spark-2.2.0-bin-hadoop2.7 with standalone) on ARMv8 64bit server(Qualcomm ARM server), and used HiBench-7.0 terasort case to validate the SparkRDMA function. Firstly I run it with Spark on yarn mode(Dynamic Resource Allocation), but it failed. And I find you can run it successfully on spark with Standalone mode. So I change to spark Standalone mode.
I building libdisni version 1.7.(git checkout tags/v1.7 -b v1.7), and configure SparkRDMA as below:
spark.driver.extraClassPath /home/xianlai/SparkRDMA/SparkRDMA-3.0/target/spark-rdma-3.0-for-spark-2.2.0-jar-with-dependencies.jar
spark.executor.extraClassPath /home/xianlai/SparkRDMA/SparkRDMA-3.0/target/spark-rdma-3.0-for-spark-2.2.0-jar-with-dependencies.jar
spark.driver.extraJavaOptions -Djava.library.path=/usr/local/lib/
spark.executor.extraJavaOptions -Djava.library.path=/usr/local/lib/
spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.shuffle.compress false
spark.shuffle.spill.compress false
spark.broadcast.compress false
spark.broadcast.checksum false
spark.locality.wait 0
The issue log is:
18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 45.3 in stage 1.0 (TID 241) on 192.168.5.136, executor 3: java.lang.reflect.InvocationTargetException (null) [duplicate 175]
18/11/30 15:03:26 ERROR scheduler.TaskSetManager: Task 45 in stage 1.0 failed 4 times; aborting job
18/11/30 15:03:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1
18/11/30 15:03:26 INFO scheduler.TaskSchedulerImpl: Stage 1 was cancelled
18/11/30 15:03:26 INFO scheduler.DAGScheduler: ShuffleMapStage 1 (map at ScalaTeraSort.scala:49) failed in 5.189 s due to Job aborted due to stage failure: Task 45 in stage 1.0 failed 4 times, most recent failure: Lost task 45.3 in stage 1.0 (TID 241, 192.168.5.136, executor 3): java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:151)
at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:110)
at org.apache.spark.shuffle.rdma.RdmaMappedFile.(RdmaMappedFile.java:88)
at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleData.writeIndexFileAndCommit(RdmaWrapperShuffleWriter.scala:65)
at org.apache.spark.shuffle.rdma.RdmaShuffleBlockResolver.writeIndexFileAndCommit(RdmaShuffleBlockResolver.scala:64)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:224)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:169)
at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleWriter.write(RdmaWrapperShuffleWriter.scala:102)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Invalid argument
at sun.nio.ch.FileChannelImpl.map0(Native Method)
... 18 more
Driver stacktrace:
18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 19.3 in stage 1.0 (TID 242) on 192.168.5.136, executor 3: java.lang.reflect.InvocationTargetException (null) [duplicate 176]
18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 17.0 in stage 1.0 (TID 63) on 192.168.5.136, executor 2: java.lang.reflect.InvocationTargetException (null) [duplicate 177]
18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 8.2 in stage 1.0 (TID 236) on 192.168.5.136, executor 4: java.lang.reflect.InvocationTargetException (null) [duplicate 178]
18/11/30 15:03:26 INFO scheduler.DAGScheduler: Job 1 failed: runJob at SparkHadoopMapReduceWriter.scala:88, took 5.472082 s
18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 89.0 in stage 1.0 (TID 135) on 192.168.5.136, executor 2: java.lang.reflect.InvocationTargetException (null) [duplicate 179]
18/11/30 15:03:26 ERROR io.SparkHadoopMapReduceWriter: Aborting job job_20181130150320_0006.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 45 in stage 1.0 failed 4 times, most recent failure: Lost task 45.3 in stage 1.0 (TID 241, 192.168.5.136, executor 3): java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:151)
at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:110)
at org.apache.spark.shuffle.rdma.RdmaMappedFile.(RdmaMappedFile.java:88)
at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleData.writeIndexFileAndCommit(RdmaWrapperShuffleWriter.scala:65)
at org.apache.spark.shuffle.rdma.RdmaShuffleBlockResolver.writeIndexFileAndCommit(RdmaShuffleBlockResolver.scala:64)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:224)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:169)
at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleWriter.write(RdmaWrapperShuffleWriter.scala:102)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Invalid argument
at sun.nio.ch.FileChannelImpl.map0(Native Method)
... 18 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
Hi I attached the log for your reference. Please help me to check it, thansks! Because my servers are ARMv8 64bit server, so I have downloaded the MLNX_OFED_LINUX-4.4-2.0.7.0-rhel7.5alternate-aarch64.tgz and have compiled it and installed it, and used perftest to test it and make sure the physical RDMA is ok. I have setup two nodes for hadoop(hadoop-2.7.1) and spark(spark-2.2.0-bin-hadoop2.7 with standalone) on ARMv8 64bit server(Qualcomm ARM server), and used HiBench-7.0 terasort case to validate the SparkRDMA function. Firstly I run it with Spark on yarn mode(Dynamic Resource Allocation), but it failed. And I find you can run it successfully on spark with Standalone mode. So I change to spark Standalone mode. I building libdisni version 1.7.(git checkout tags/v1.7 -b v1.7), and configure SparkRDMA as below: spark.driver.extraClassPath /home/xianlai/SparkRDMA/SparkRDMA-3.0/target/spark-rdma-3.0-for-spark-2.2.0-jar-with-dependencies.jar spark.executor.extraClassPath /home/xianlai/SparkRDMA/SparkRDMA-3.0/target/spark-rdma-3.0-for-spark-2.2.0-jar-with-dependencies.jar spark.driver.extraJavaOptions -Djava.library.path=/usr/local/lib/ spark.executor.extraJavaOptions -Djava.library.path=/usr/local/lib/ spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager spark.shuffle.compress false spark.shuffle.spill.compress false spark.broadcast.compress false spark.broadcast.checksum false spark.locality.wait 0 The issue log is: 18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 45.3 in stage 1.0 (TID 241) on 192.168.5.136, executor 3: java.lang.reflect.InvocationTargetException (null) [duplicate 175] 18/11/30 15:03:26 ERROR scheduler.TaskSetManager: Task 45 in stage 1.0 failed 4 times; aborting job 18/11/30 15:03:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1 18/11/30 15:03:26 INFO scheduler.TaskSchedulerImpl: Stage 1 was cancelled 18/11/30 15:03:26 INFO scheduler.DAGScheduler: ShuffleMapStage 1 (map at ScalaTeraSort.scala:49) failed in 5.189 s due to Job aborted due to stage failure: Task 45 in stage 1.0 failed 4 times, most recent failure: Lost task 45.3 in stage 1.0 (TID 241, 192.168.5.136, executor 3): java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:151) at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:110) at org.apache.spark.shuffle.rdma.RdmaMappedFile.(RdmaMappedFile.java:88)
at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleData.writeIndexFileAndCommit(RdmaWrapperShuffleWriter.scala:65)
at org.apache.spark.shuffle.rdma.RdmaShuffleBlockResolver.writeIndexFileAndCommit(RdmaShuffleBlockResolver.scala:64)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:224)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:169)
at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleWriter.write(RdmaWrapperShuffleWriter.scala:102)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Invalid argument
at sun.nio.ch.FileChannelImpl.map0(Native Method)
... 18 more
Driver stacktrace: 18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 19.3 in stage 1.0 (TID 242) on 192.168.5.136, executor 3: java.lang.reflect.InvocationTargetException (null) [duplicate 176] 18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 17.0 in stage 1.0 (TID 63) on 192.168.5.136, executor 2: java.lang.reflect.InvocationTargetException (null) [duplicate 177] 18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 8.2 in stage 1.0 (TID 236) on 192.168.5.136, executor 4: java.lang.reflect.InvocationTargetException (null) [duplicate 178] 18/11/30 15:03:26 INFO scheduler.DAGScheduler: Job 1 failed: runJob at SparkHadoopMapReduceWriter.scala:88, took 5.472082 s 18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 89.0 in stage 1.0 (TID 135) on 192.168.5.136, executor 2: java.lang.reflect.InvocationTargetException (null) [duplicate 179] 18/11/30 15:03:26 ERROR io.SparkHadoopMapReduceWriter: Aborting job job_20181130150320_0006. org.apache.spark.SparkException: Job aborted due to stage failure: Task 45 in stage 1.0 failed 4 times, most recent failure: Lost task 45.3 in stage 1.0 (TID 241, 192.168.5.136, executor 3): java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:151) at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:110) at org.apache.spark.shuffle.rdma.RdmaMappedFile.(RdmaMappedFile.java:88)
at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleData.writeIndexFileAndCommit(RdmaWrapperShuffleWriter.scala:65)
at org.apache.spark.shuffle.rdma.RdmaShuffleBlockResolver.writeIndexFileAndCommit(RdmaShuffleBlockResolver.scala:64)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:224)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:169)
at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleWriter.write(RdmaWrapperShuffleWriter.scala:102)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Invalid argument
at sun.nio.ch.FileChannelImpl.map0(Native Method)
... 18 more
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)