Mellanox / SparkRDMA

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx
Apache License 2.0
240 stars 70 forks source link

Need memory overhead to run Spark RDMA shuffler #30

Closed tobegit3hub closed 5 years ago

tobegit3hub commented 5 years ago

We use SparkRDMA by adding libdisni.so and spark-rdma-3.1-for-spark-2.3.0-jar-with-dependencies.jar in command of spark-submit. But some tasks always fail because of the issue of java.io.IOException: getCmEvent() failed. Here is the complete log.

java.io.IOException: getCmEvent() failed
    at org.apache.spark.shuffle.rdma.RdmaChannel.processRdmaCmEvent(RdmaChannel.java:348)
    at org.apache.spark.shuffle.rdma.RdmaChannel.connect(RdmaChannel.java:309)
    at org.apache.spark.shuffle.rdma.RdmaNode.getRdmaChannel(RdmaNode.java:308)
    at org.apache.spark.shuffle.rdma.RdmaShuffleManager.org$apache$spark$shuffle$rdma$RdmaShuffleManager$$getRdmaChannel(RdmaShuffleManager.scala:314)
    at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getRdmaChannelToDriver(RdmaShuffleManager.scala:322)
    at org.apache.spark.shuffle.rdma.RdmaShuffleManager$$anon$7.apply(RdmaShuffleManager.scala:349)
    at org.apache.spark.shuffle.rdma.RdmaShuffleManager$$anon$7.apply(RdmaShuffleManager.scala:343)
    at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
    at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getMapTaskOutputTable(RdmaShuffleManager.scala:342)
    at org.apache.spark.shuffle.rdma.RdmaShuffleFetcherIterator.startAsyncRemoteFetches(RdmaShuffleFetcherIterator.scala:185)
    at org.apache.spark.shuffle.rdma.RdmaShuffleFetcherIterator.initialize(RdmaShuffleFetcherIterator.scala:325)
    at org.apache.spark.shuffle.rdma.RdmaShuffleFetcherIterator.<init>(RdmaShuffleFetcherIterator.scala:85)
    at org.apache.spark.shuffle.rdma.RdmaShuffleReader.read(RdmaShuffleReader.scala:44)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

It seems to require more memory to run SparkRDMA and it may work by adding --conf spark.executor.memoryOverhead=10g when using this. We are confused about this and how much memory should we add for this task?

petro-rudenko commented 5 years ago

SparkRDMA uses native off-heap memory. So yes, need to add memoryOverhead parameter. The size of memoryOverhead depends on the total dataset size, the number of executors, cores per executor.

tobegit3hub commented 5 years ago

Thanks @petro-rudenko . Do you have any guide to set the memoryOverhead? For example, if the shuffle read or shuffle write is 10G for each executor, do we need about 10G memoryOverhead for SparkRDMA?

petro-rudenko commented 5 years ago

@tobegit3hub you would need 10Gb + # cores / executor spark.shuffle.rdma.MaxBytesInFlight E.g. 10Gb + 20 cores 500Mb = ~20 Gb.

tobegit3hub commented 5 years ago

Thanks for your detailed exploration. That helps a lot @petro-rudenko !