Mellanox / SparkRDMA

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx
Apache License 2.0
241 stars 70 forks source link

libdisni resolve hostname with another IP instead of the IP from RdmaNode #26

Closed tobegit3hub closed 5 years ago

tobegit3hub commented 5 years ago

We have try to run SparkRDMA in yarn cluster. The job has been submitted successfully and initialized the RDMA network. Here is the log.

019-02-27 17:01:37 INFO  disni:42 - creating  RdmaProvider of type 'nat'
2019-02-27 17:01:37 INFO  disni:40 - jverbs jni version 32
2019-02-27 17:01:37 INFO  disni:46 - sock_addr_in size mismatch, jverbs size 28, native size 16
2019-02-27 17:01:37 INFO  disni:55 - IbvRecvWR size match, jverbs size 32, native size 32
2019-02-27 17:01:37 INFO  disni:58 - IbvSendWR size mismatch, jverbs size 72, native size 128
2019-02-27 17:01:37 INFO  disni:67 - IbvWC size match, jverbs size 48, native size 48
2019-02-27 17:01:37 INFO  disni:73 - IbvSge size match, jverbs size 16, native size 16
2019-02-27 17:01:37 INFO  disni:80 - Remote addr offset match, jverbs size 40, native size 40
2019-02-27 17:01:37 INFO  disni:86 - Rkey offset match, jverbs size 48, native size 48
2019-02-27 17:01:37 INFO  disni:61 - createEventChannel, objId 140472690913856
2019-02-27 17:01:37 INFO  disni:79 - createId, id 140472690927680
2019-02-27 17:01:37 INFO  disni:138 - bindAddr, address /192.168.1.4:0
2019-02-27 17:01:37 INFO  RdmaNode:223 - cpuList from configuration file: 
2019-02-27 17:01:37 INFO  RdmaNode:258 - Empty or failure parsing the cpuList. Defaulting to all available CPUs
2019-02-27 17:01:37 INFO  RdmaNode:274 - Using cpuList: [5, 15, 10, 22, 6, 20, 37, 7, 30, 36, 35, 2, 14, 32, 29, 25, 18, 21, 11, 33, 16, 31, 24, 38, 39, 19, 27, 34, 9, 13, 3, 17, 4, 12, 26, 28, 23, 8, 1, 0]
2019-02-27 17:01:37 INFO  disni:150 - listen, id 0
2019-02-27 17:01:37 INFO  disni:69 - allocPd, objId 140472691160992
2019-02-27 17:01:37 INFO  RdmaNode:116 - Starting RdmaNode Listening Server, listening on: /192.168.1.4:57810

If we run the TeraSort which uses shuffle with rdma shuffle manager, the executor will exit unexpectedly. Here is the error log.

2019-02-27 17:01:43 INFO  DAGScheduler:54 - Shuffle files lost for executor: 1 (epoch 1)
2019-02-27 17:01:43 INFO  YarnAllocator:54 - Completed container container_1547200321055_0731_01_000002 on host: m7-model-inf07 (state: COMPLETE, exit status: 134)
2019-02-27 17:01:43 WARN  YarnAllocator:66 - Container marked as failed: container_1547200321055_0731_01_000002 on host: m7-model-inf07. Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1547200321055_0731_01_000002
Exit code: 134
Exception message: /bin/bash: line 1: 17944 Aborted                 LD_LIBRARY_PATH=./libdisni.so::/mnt/disk0/home/work/hadoop/lib/native:/mnt/disk0/home/work/hadoop/lib/native /home/work/jdk/bin/java -server -Xmx8192m -Djava.io.tmpdir=/mnt/disk1/nm-local/usercache/work/appcache/application_1547200321055_0731/container_1547200321055_0731_01_000002/tmp '-Dspark.ui.port=0' '-Dspark.driver.port=38467' -Dspark.yarn.app.container.log.dir=/mnt/disk0/home/work/hadoop/logs/userlogs/application_1547200321055_0731/container_1547200321055_0731_01_000002 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@m7-model-inf03:38467 --executor-id 1 --hostname m7-model-inf07 --cores 4 --app-id application_1547200321055_0731 --user-class-path file:/mnt/disk1/nm-local/usercache/work/appcache/application_1547200321055_0731/container_1547200321055_0731_01_000002/__app__.jar > /mnt/disk0/home/work/hadoop/logs/userlogs/application_1547200321055_0731/container_1547200321055_0731_01_000002/stdout 2> /mnt/disk0/home/work/hadoop/logs/userlogs/application_1547200321055_0731/container_1547200321055_0731_01_000002/stderr

Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 17944 Aborted                 LD_LIBRARY_PATH=./libdisni.so::/mnt/disk0/home/work/hadoop/lib/native:/mnt/disk0/home/work/hadoop/lib/native /home/work/jdk/bin/java -server -Xmx8192m -Djava.io.tmpdir=/mnt/disk1/nm-local/usercache/work/appcache/application_1547200321055_0731/container_1547200321055_0731_01_000002/tmp '-Dspark.ui.port=0' '-Dspark.driver.port=38467' -Dspark.yarn.app.container.log.dir=/mnt/disk0/home/work/hadoop/logs/userlogs/application_1547200321055_0731/container_1547200321055_0731_01_000002 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@m7-model-inf03:38467 --executor-id 1 --hostname m7-model-inf07 --cores 4 --app-id application_1547200321055_0731 --user-class-path file:/mnt/disk1/nm-local/usercache/work/appcache/application_1547200321055_0731/container_1547200321055_0731_01_000002/__app__.jar > /mnt/disk0/home/work/hadoop/logs/userlogs/application_1547200321055_0731/container_1547200321055_0731_01_000002/stdout 2> /mnt/disk0/home/work/hadoop/logs/userlogs/application_1547200321055_0731/container_1547200321055_0731_01_000002/stderr

    at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
    at org.apache.hadoop.util.Shell.run(Shell.java:479)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 134

If we try to read the log of Yarn container, it shows that disni call bindAddr with the actual RDMA IP(192.168.1.9) but call resolveAddr with the wrong non-RoCE IP(172.27.128.154).

2019-02-27 15:47:37 INFO  disni:80 - Remote addr offset match, jverbs size 40, native size 40                                                                                                               2019-02-27 15:47:37 INFO  disni:86 - Rkey offset match, jverbs size 48, native size 48
2019-02-27 15:47:37 INFO  disni:61 - createEventChannel, objId 140534197564176                                                                                                                              2019-02-27 15:47:37 INFO  disni:79 - createId, id 140534197563312
2019-02-27 15:47:37 INFO  disni:138 - bindAddr, address /192.168.1.9:0
2019-02-27 15:47:37 INFO  RdmaNode:223 - cpuList from configuration file:
2019-02-27 15:47:37 INFO  RdmaNode:258 - Empty or failure parsing the cpuList. Defaulting to all available CPUs
2019-02-27 15:47:37 INFO  RdmaNode:274 - Using cpuList: [18, 8, 22, 17, 31, 3, 5, 34, 24, 37, 20, 25, 13, 16, 28, 21, 36, 26, 9, 33, 10, 6, 14, 1, 27, 39, 4, 38, 15, 19, 7, 35, 32, 12, 30, 29, 2, 23, 11,
0]                                                                                                                                                                                                          2019-02-27 15:47:37 INFO  disni:150 - listen, id 0
2019-02-27 15:47:37 INFO  disni:69 - allocPd, objId 140534189036912
2019-02-27 15:47:37 INFO  RdmaNode:116 - Starting RdmaNode Listening Server, listening on: /192.168.1.9:38440
2019-02-27 15:47:37 INFO  disni:61 - createEventChannel, objId 19752576
2019-02-27 15:47:37 INFO  disni:79 - createId, id 19753632
2019-02-27 15:47:37 INFO  disni:167 - resolveAddr, addres m7-model-inf08/172.27.128.154:37652
2019-02-27 15:47:37 INFO  RdmaChannel:874 - Stopping RdmaChannel RdmaChannel(0)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd0649e197b, pid=33656, tid=0x00007fd06a9cf700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_121-b13) (build 1.8.0_121-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.121-b13 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [librdmacm.so.1+0x597b]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/disk3/nm-local/usercache/work/appcache/application_1547200321055_0724/container_1547200321055_0724_02_000006/hs_err_pid33656.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
End of LogType:stdout
petro-rudenko commented 5 years ago

Ok, need to add to yarn-env.sh:

export YARN_NODEMANAGER_OPTS="-Dyarn.nodemanager.hostname=$RDMA_IP"
tobegit3hub commented 5 years ago

Thanks @petro-rudenko . It seems good and we will try later.

tobegit3hub commented 5 years ago

Same issue with https://github.com/Mellanox/SparkRDMA/issues/25 .