Closed tobegit3hub closed 5 years ago
Ok, need to add to yarn-env.sh:
export YARN_NODEMANAGER_OPTS="-Dyarn.nodemanager.hostname=$RDMA_IP"
Thanks @petro-rudenko . It seems good and we will try later.
Same issue with https://github.com/Mellanox/SparkRDMA/issues/25 .
We have try to run SparkRDMA in yarn cluster. The job has been submitted successfully and initialized the RDMA network. Here is the log.
If we run the TeraSort which uses shuffle with rdma shuffle manager, the executor will exit unexpectedly. Here is the error log.
If we try to read the log of Yarn container, it shows that disni call
bindAddr
with the actual RDMA IP(192.168.1.9) but callresolveAddr
with the wrong non-RoCE IP(172.27.128.154).