Mellanox / SparkRDMA

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx
Apache License 2.0
241 stars 70 forks source link

Getting lower RDMA perf that TCP/IP perf #21

Open gilbertc777 opened 5 years ago

gilbertc777 commented 5 years ago

Hi all,

I am running Hibench suite terasort against RDMA Spark. My current results show that TCP is performaing better than RDMA. Any thoughts on configs to look into? Happy to post config settings currently in place as well to help diagnose.

petro-rudenko commented 5 years ago

Hi, sure. Would be glad to assist. Can you please tell your setup (number of nodes, hardware type, what workload are you running, spark/hibench config that you are using, etc).

gilbertc777 commented 5 years ago

Hi Peter,

I am using Hibench for running Tersort and Wordcont benchmarks.

The setup is as follows:

10 nodes with 36 cores each @ 3Ghz and 384GB ram

100G ROCE over Mellanox CX4 cards

I will upload the configs in a few.

Thanks for the help!

gilbertc777 commented 5 years ago
SPARK_MASTER_HOST=hpc-cluster-cn1-rdma
SPARK_MASTER_IP=192.18.18.1
SPARK_LOCAL_DIRS=/localvolume/hadoop/spark/tmp
SPARK_WORKER_DIR=/localvolume/hadoop/spark/worker
SPARK_LOCAL_DIRS=/localvolume/hadoop/spark/local
RDMA_INTERFACE=mlx5_0
--

HiBench spark.conf

# Spark home
hibench.spark.home      $SPARK_HOME
# Spark master
#   standalone mode: spark://xxx:7077
#   YARN mode: yarn-client 
hibench.spark.master    spark://hpc-cluster-cn1-rdma:7077
#hibench.spark.master yarn
spark.shuffle.compress          false
spark.shuffle.spill.compress    false
spark.driver.extraClassPath /localvolume/nfs-share/hadoop-rdma/applications/mellanox/spark-2.4.0-bin-hadoop2.7/rdma/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
spark.executor.extraClassPath /localvolume/nfs-share/hadoop-rdma/applications/mellanox/spark-2.4.0-bin-hadoop2.7/rdma/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.roce.enabled false
spark.ib.enabled false
spark.shuffle.rdma.device.name mlx5_0
spark.shuffle.rdma.pconn.enabled true
spark.broadcast.compress false                                                                                                                                               spark.broadcast.checksum false
spark.locality.wait 0 
#spark.shuffle.rdma.chunk.size 2097152                                                                                                                                       #spark.shuffle.rdma.memory.slab.size 134217728                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         # executor number and cores when running on Yarn                                                                                                                             hibench.yarn.executor.num     2
hibench.yarn.executor.cores   2
# executor and driver memory in standalone & YARN mode
spark.executor.memory  4g
spark.driver.memory    4g
--

[opc@hpc-cluster-cn1-rdma conf]$ cat hadoop.conf
# Hadoop home
hibench.hadoop.home    $HADOOP_HOME                                                                                                                                                                                                                                # The path of hadoop executable
hibench.hadoop.executable     ${hibench.hadoop.home}/bin/hadoop
# Hadoop configraution directory 
hibench.hadoop.configure.dir  ${hibench.hadoop.home}/etc/hadoop                                                                                                                                                                                                                                                                                           # The root HDFS path to store HiBench data                                                                                                                                   hibench.hdfs.master       hdfs://hpc-cluster-cn1-rdma:9000
# Hadoop release provider. Supported value: apache, cdh5, hdp
hibench.hadoop.release    apache
--
gilbertc777 commented 5 years ago

Top of the hibench.conf

[opc@hpc-cluster-cn1-rdma conf]$ cat hibench.conf
# Data scale profile. Available value is tiny, small, large, huge, gigantic and bigdata.                                                                                     # The definition of these profiles can be found in the workload's conf file i.e. conf/workloads/micro/wordcount.conf     
hibench.scale.profile                bigdata 
# Mapper number in hadoop, partition number in Spark
hibench.default.map.parallelism         1000
# Reducer nubmer in hadoop, shuffle partition number in Spark
hibench.default.shuffle.parallelism     15000
petro-rudenko commented 5 years ago

@gilbertc777 thanks, what's the result you get for TCP and RDMA?

gilbertc777 commented 5 years ago

ScalaSparkTerasort 2019-02-12 23:58:52 32000000000 112.184 285245667 28524566

ScalaSparkTerasort 2019-02-13 00:02:02 32000000000 122.759 260673351 26067335

The first test is running TCP, the secons is running with RDMA enabled. Note RDMA runtime is higher.

petro-rudenko commented 5 years ago

I see.

  1. Do you have flow control configuration
  2. Can you please try to run pagerank workload?
  3. Can you please share spark executor logs?
jiameixie commented 5 years ago

Hi,I have the same problem.The configrure is almost same. The following is part of benchlog. Can you help me

19/07/11 14:55:11 INFO cluster.YarnClientSchedulerBackend: Stopped 19/07/11 14:55:11 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(11) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(14) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(3) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(12) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(7) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(6) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(16) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(8) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(1) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(15) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(13) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(5) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(2) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(4) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(0) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(9) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(17) 19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(10) 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 java.lang.NullPointerException at org.apache.spark.shuffle.rdma.RdmaChannel.processRdmaCmEvent(RdmaChannel.java:345) at org.apache.spark.shuffle.rdma.RdmaChannel.stop(RdmaChannel.java:894) at org.apache.spark.shuffle.rdma.RdmaNode.lambda$new$0(RdmaNode.java:203) at java.lang.Thread.run(Thread.java:748) Exception in thread "RdmaNode connection listening thread" java.lang.RuntimeException: Exception in RdmaNode listening thread java.lang.NullPointerException at org.apache.spark.shuffle.rdma.RdmaNode.lambda$new$0(RdmaNode.java:210) at java.lang.Thread.run(Thread.java:748) 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0 19/07/11 14:55:11 INFO rdma.RdmaBufferManager: Rdma buffers allocation statistics: 19/07/11 14:55:11 INFO rdma.RdmaBufferManager: Pre allocated 0, allocated 4611 buffers of size 4 KB 19/07/11 14:55:11 INFO ibm.disni: deallocPd, pd 1 19/07/11 14:55:11 INFO ibm.disni: destroyCmId, id 0 19/07/11 14:55:11 INFO ibm.disni: destroyEventChannel, channel 0 19/07/11 14:55:11 INFO memory.MemoryStore: MemoryStore cleared 19/07/11 14:55:11 INFO storage.BlockManager: BlockManager stopped 19/07/11 14:55:11 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 19/07/11 14:55:11 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 19/07/11 14:55:11 INFO spark.SparkContext: Successfully stopped SparkContext 19/07/11 14:55:11 INFO util.ShutdownHookManager: Shutdown hook called

jiameixie commented 5 years ago

some of the yarn log: 19/07/11 16:19:43 INFO rdma.RdmaNode: Starting RdmaNode Listening Server, listening on: /8.8.8.4:50189 19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:60996 in 253 ms 19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:56128 in 153 ms 19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:43667 in 153 ms 19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:49283 in 161 ms 19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:48335 in 154 ms 19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:35004 in 154 ms 19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.5:53523 in 156 ms 19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:57867 in 172 ms 19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:35765 in 181 ms 19/07/11 16:19:45 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 68 to driver took 6 ms 19/07/11 16:19:45 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 16 to driver took 0 ms 19/07/11 16:19:45 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 49 to driver took 0 ms 19/07/11 16:19:45 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 32 to driver took 1 ms 19/07/11 16:19:45 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 87 to driver took 2 ms 19/07/11 16:19:46 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 35 to driver took 0 ms 19/07/11 16:19:46 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 47 to driver took 1 ms 19/07/11 16:19:46 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 26 to driver took 0 ms 19/07/11 16:19:46 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 61 to driver took 0 ms 19/07/11 16:19:46 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 53 to driver took 1 ms 19/07/11 16:19:47 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 98 to driver took 1 ms 19/07/11 16:19:47 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 108 to driver took 1 ms 19/07/11 16:19:47 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 112 to driver took 0 ms 19/07/11 16:19:48 INFO rdma.RdmaShuffleManager: Getting MapTaskOutput table for shuffleId: 0 of size 4096 from driver 19/07/11 16:19:48 INFO rdma.RdmaShuffleManager: RDMA read mapTaskOutput table for shuffleId: 0 took 3 ms 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(7) 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(6) 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(0) 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(8) 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(2) 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(3) 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(9) 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(1) 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(5) 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(12) 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(4) 19/07/11 16:19:57 ERROR rdma.RdmaNode: Failed to stop RdmaChannel during 50 ms 19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed 19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed 19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed 19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed 19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed 19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed 19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed 19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed 19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(14) 19/07/11 16:19:57 ERROR rdma.RdmaNode: Failed to stop RdmaChannel during 50 ms 19/07/11 16:19:58 INFO rdma.RdmaNode: Exiting RdmaNode Listening Server 19/07/11 16:19:58 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(16) 19/07/11 16:19:58 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(13) 19/07/11 16:19:58 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(15) 19/07/11 16:19:58 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(11) 19/07/11 16:19:58 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(10) 19/07/11 16:19:58 ERROR rdma.RdmaNode: Failed to stop RdmaChannel during 50 ms 19/07/11 16:19:58 INFO rdma.RdmaBufferManager: Rdma buffers allocation statistics: 19/07/11 16:19:58 INFO rdma.RdmaBufferManager: Pre allocated 0, allocated 342 buffers of size 4 KB 19/07/11 16:19:58 INFO rdma.RdmaBufferManager: Pre allocated 0, allocated 535 buffers of size 512 KB SLF4J: Found binding in [jar:file:/share/spark_rdma/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class

petro-rudenko commented 5 years ago

So these errors are during the shutdown phase. Does the job itself complete successfully?

jiameixie commented 5 years ago

Does this mean complete successfully? 19/07/11 16:33:52 INFO DAGScheduler: ResultStage 2 (runJob at SparkHadoopMapReduceWriter.scala:88) finished in 8.207 s 19/07/11 16:33:52 INFO DAGScheduler: Job 1 finished: runJob at SparkHadoopMapReduceWriter.scala:88, took 13.799956 s 19/07/11 16:33:55 INFO SparkHadoopMapReduceWriter: Job job_20190711163338_0006 committed.

petro-rudenko commented 5 years ago

Yes. so the job is finished. You can see job history by setting:

spark.eventLog.enabled true
spark.eventLog.dir PATH_FOR_HISTORY_FILES
jiameixie commented 5 years ago

In spark-defaults.conf, setting the below, there weren't logs in the directory. I run in yarn client mode. So do you have another way to check this issue? spark.eventLog.enabled true spark.eventLog.dir /tmp/spark_logs

petro-rudenko commented 5 years ago

https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact You need to start historyserver + set spark.eventLog.dir hdfs:///tmp/spark_logs

jiameixie commented 5 years ago

Hi, Petro, I still have the problem. Can you help me analysis it? Thanks. I use 2 arm machine node to run terasort with "huge" profile (about 29G data ).

cpu info: Architecture: aarch64 Byte Order: Little Endian CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 1 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 Vendor ID: Cavium Model: 1 Model name: ThunderX2 99xx Stepping: 0x1 BogoMIPS: 400.00 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 32768K NUMA node0 CPU(s): 0-27 NUMA node1 CPU(s): 28-55 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

Net card:

Infiniband device 'mlx5_1' port 1 status: default gid: fe80:0000:0000:0000:9a03:9bff:fe68:4e47 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) link_layer: Ethernet

driver: mlx5_core version: 4.6-1.0.1 firmware-version: 16.23.1020 (MT_0000000090) expansion-rom-version: bus-info: 0000:05:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes

OS: ubuntu 18.04 java version "1.8.0_211" hadoop version :2.7.4 spark version :2.2.3 dinsi version:v1.7

top of hibench.conf hibench.scale.profile huge

Mapper number in hadoop, partition number in Spark

hibench.default.map.parallelism 200

Reducer nubmer in hadoop, shuffle partition number in Spark

hibench.default.shuffle.parallelism 120

hadoop.conf

Hadoop home

hibench.hadoop.home /home/linux/xjm/spark_rdma/hadoop-2.7.4

The path of hadoop executable

hibench.hadoop.executable ${hibench.hadoop.home}/bin/hadoop

Hadoop configraution directory

hibench.hadoop.configure.dir ${hibench.hadoop.home}/etc/hadoop

The root HDFS path to store HiBench data

hibench.hdfs.master hdfs://wls-arm-cavium05:9000

Hadoop release provider. Supported value: apache, cdh5, hdp

hibench.hadoop.release apache

The test result (the following is the duration time): no-rdma: 405 rdma: 401

The following are configuration files and executor log. Did my configuration have problems?

stderr-excutor1-no-rdma.txt stderr-excutor2-rdma.txt stderr_excutor1-rdma.txt bench.log spark.conf.txt core-site.xml.txt hdfs-site.xml.txt spark-defauts.conf.txt spark-env.sh.txt

petro-rudenko commented 5 years ago

@jiameixie, so 2 is to few to get real advantage of rdma vs tcp. So try to increase number of executors + number of cores + workload size (at least gigantic or bigdata).