Mellanox / SparkRDMA

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx
Apache License 2.0
240 stars 70 forks source link

SparkRDMA issue:ERROR scheduler.TaskSetManager: Task 45 in stage 1.0 failed 4 times; aborting job #16

Closed XianlaiShen closed 5 years ago

XianlaiShen commented 5 years ago

Hi I attached the log for your reference. Please help me to check it, thansks! Because my servers are ARMv8 64bit server, so I have downloaded the MLNX_OFED_LINUX-4.4-2.0.7.0-rhel7.5alternate-aarch64.tgz and have compiled it and installed it, and used perftest to test it and make sure the physical RDMA is ok. I have setup two nodes for hadoop(hadoop-2.7.1) and spark(spark-2.2.0-bin-hadoop2.7 with standalone) on ARMv8 64bit server(Qualcomm ARM server), and used HiBench-7.0 terasort case to validate the SparkRDMA function. Firstly I run it with Spark on yarn mode(Dynamic Resource Allocation), but it failed. And I find you can run it successfully on spark with Standalone mode. So I change to spark Standalone mode. I building libdisni version 1.7.(git checkout tags/v1.7 -b v1.7), and configure SparkRDMA as below: spark.driver.extraClassPath /home/xianlai/SparkRDMA/SparkRDMA-3.0/target/spark-rdma-3.0-for-spark-2.2.0-jar-with-dependencies.jar spark.executor.extraClassPath /home/xianlai/SparkRDMA/SparkRDMA-3.0/target/spark-rdma-3.0-for-spark-2.2.0-jar-with-dependencies.jar spark.driver.extraJavaOptions -Djava.library.path=/usr/local/lib/ spark.executor.extraJavaOptions -Djava.library.path=/usr/local/lib/ spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager spark.shuffle.compress false spark.shuffle.spill.compress false spark.broadcast.compress false spark.broadcast.checksum false spark.locality.wait 0 The issue log is: 18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 45.3 in stage 1.0 (TID 241) on 192.168.5.136, executor 3: java.lang.reflect.InvocationTargetException (null) [duplicate 175] 18/11/30 15:03:26 ERROR scheduler.TaskSetManager: Task 45 in stage 1.0 failed 4 times; aborting job 18/11/30 15:03:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1 18/11/30 15:03:26 INFO scheduler.TaskSchedulerImpl: Stage 1 was cancelled 18/11/30 15:03:26 INFO scheduler.DAGScheduler: ShuffleMapStage 1 (map at ScalaTeraSort.scala:49) failed in 5.189 s due to Job aborted due to stage failure: Task 45 in stage 1.0 failed 4 times, most recent failure: Lost task 45.3 in stage 1.0 (TID 241, 192.168.5.136, executor 3): java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:151) at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:110) at org.apache.spark.shuffle.rdma.RdmaMappedFile.(RdmaMappedFile.java:88) at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleData.writeIndexFileAndCommit(RdmaWrapperShuffleWriter.scala:65) at org.apache.spark.shuffle.rdma.RdmaShuffleBlockResolver.writeIndexFileAndCommit(RdmaShuffleBlockResolver.scala:64) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:224) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:169) at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleWriter.write(RdmaWrapperShuffleWriter.scala:102) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Invalid argument at sun.nio.ch.FileChannelImpl.map0(Native Method) ... 18 more

Driver stacktrace: 18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 19.3 in stage 1.0 (TID 242) on 192.168.5.136, executor 3: java.lang.reflect.InvocationTargetException (null) [duplicate 176] 18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 17.0 in stage 1.0 (TID 63) on 192.168.5.136, executor 2: java.lang.reflect.InvocationTargetException (null) [duplicate 177] 18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 8.2 in stage 1.0 (TID 236) on 192.168.5.136, executor 4: java.lang.reflect.InvocationTargetException (null) [duplicate 178] 18/11/30 15:03:26 INFO scheduler.DAGScheduler: Job 1 failed: runJob at SparkHadoopMapReduceWriter.scala:88, took 5.472082 s 18/11/30 15:03:26 INFO scheduler.TaskSetManager: Lost task 89.0 in stage 1.0 (TID 135) on 192.168.5.136, executor 2: java.lang.reflect.InvocationTargetException (null) [duplicate 179] 18/11/30 15:03:26 ERROR io.SparkHadoopMapReduceWriter: Aborting job job_20181130150320_0006. org.apache.spark.SparkException: Job aborted due to stage failure: Task 45 in stage 1.0 failed 4 times, most recent failure: Lost task 45.3 in stage 1.0 (TID 241, 192.168.5.136, executor 3): java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:151) at org.apache.spark.shuffle.rdma.RdmaMappedFile.mapAndRegister(RdmaMappedFile.java:110) at org.apache.spark.shuffle.rdma.RdmaMappedFile.(RdmaMappedFile.java:88) at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleData.writeIndexFileAndCommit(RdmaWrapperShuffleWriter.scala:65) at org.apache.spark.shuffle.rdma.RdmaShuffleBlockResolver.writeIndexFileAndCommit(RdmaShuffleBlockResolver.scala:64) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:224) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:169) at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleWriter.write(RdmaWrapperShuffleWriter.scala:102) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Invalid argument at sun.nio.ch.FileChannelImpl.map0(Native Method) ... 18 more

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)