Mellanox / SparkRDMA

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx
Apache License 2.0
241 stars 70 forks source link

java.lang.NoClassDefFoundError: Could not initialize class com.ibm.disni .rdma.verbs.impl.NativeDispatcher #27

Closed Akshay-Venkatesh closed 5 years ago

Akshay-Venkatesh commented 5 years ago

I'm seeing the error below when running a spark on 2-nodes (1 master and 2 workers). I'm not a frequent user of Java but any thoughts on why I'd be seeing an initialization error here?

2019-02-27 22:37:36 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
2019-02-27 22:38:10 WARN  TaskSetManager:66 - Lost task 11.0 in stage 0.0 (TID 11, 10.31.229.69, executor 0): java.lang.NoClassDefFoundError: Co
uld not initialize class com.ibm.disni.rdma.verbs.impl.NativeDispatcher
        at com.ibm.disni.rdma.verbs.impl.RdmaProviderNat.<init>(RdmaProviderNat.java:43)
        at com.ibm.disni.rdma.verbs.RdmaProvider.provider(RdmaProvider.java:58)
        at com.ibm.disni.rdma.verbs.RdmaCm.open(RdmaCm.java:49)
        at com.ibm.disni.rdma.verbs.RdmaEventChannel.createEventChannel(RdmaEventChannel.java:66)
        at org.apache.spark.shuffle.rdma.RdmaNode.<init>(RdmaNode.java:64)
        at org.apache.spark.shuffle.rdma.RdmaShuffleManager.startRdmaNodeIfMissing(RdmaShuffleManager.scala:193)
        at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getWriter(RdmaShuffleManager.scala:266)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:98)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

2019-02-27 22:38:10 WARN  TaskSetManager:66 - Lost task 5.0 in stage 0.0 (TID 5, 10.31.229.69, executor 0): java.lang.UnsatisfiedLinkError: no d
isni in java.library.path   
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
        at java.lang.Runtime.loadLibrary0(Runtime.java:870)
        at java.lang.System.loadLibrary(System.java:1122)
        at com.ibm.disni.rdma.verbs.impl.NativeDispatcher.<clinit>(NativeDispatcher.java:36)
        at com.ibm.disni.rdma.verbs.impl.RdmaProviderNat.<init>(RdmaProviderNat.java:43)
        at com.ibm.disni.rdma.verbs.RdmaProvider.provider(RdmaProvider.java:58)
        at com.ibm.disni.rdma.verbs.RdmaCm.open(RdmaCm.java:49)
        at com.ibm.disni.rdma.verbs.RdmaEventChannel.createEventChannel(RdmaEventChannel.java:66)
        at org.apache.spark.shuffle.rdma.RdmaNode.<init>(RdmaNode.java:64)
        at org.apache.spark.shuffle.rdma.RdmaShuffleManager.startRdmaNodeIfMissing(RdmaShuffleManager.scala:193)
        at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getWriter(RdmaShuffleManager.scala:266)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:98)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
        ...

        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class com.ibm.disni.rdma.verbs.impl.NativeDispatcher
        at com.ibm.disni.rdma.verbs.impl.RdmaProviderNat.<init>(RdmaProviderNat.java:43)
        at com.ibm.disni.rdma.verbs.RdmaProvider.provider(RdmaProvider.java:58)
        at com.ibm.disni.rdma.verbs.RdmaCm.open(RdmaCm.java:49)
        at com.ibm.disni.rdma.verbs.RdmaEventChannel.createEventChannel(RdmaEventChannel.java:66)
        at org.apache.spark.shuffle.rdma.RdmaNode.<init>(RdmaNode.java:64)
        at org.apache.spark.shuffle.rdma.RdmaShuffleManager.startRdmaNodeIfMissing(RdmaShuffleManager.scala:193)
        at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getWriter(RdmaShuffleManager.scala:266)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:98)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
petro-rudenko commented 5 years ago

So you need to put libdisni.so from release tarball to some directory on all spark executors. Is this directory is not in java.library.path on every Spark Master and Worker (usually in /usr/lib) - then you need to add spark configuration:

spark.executor.extraJavaOptions -Djava.library.path=/hpc/scrap/users/swat/jenkins/disni/
spark.driver.extraJavaOptions   -Djava.library.path=/hpc/scrap/users/swat/jenkins/disni/ 
Akshay-Venkatesh commented 5 years ago

Thanks a lot! Your latter suggestion helped.

Akshay-Venkatesh commented 5 years ago

Sorry reopneing this because there maybe a related issue. I see this at the end of the run:

2019-02-28 15:25:49 ERROR RdmaNode:384 - Failed to stop RdmaChannel during 50 ms
petro-rudenko commented 5 years ago

This is OK, just to save time at the job end, it forcefully stops an RDMA channel.