Open YouhuiBai opened 3 years ago
I don't think AWS p3dn.24xlarge supports native RDMA. Can it run ib_write_bw
benchmark (you can install it by apt-get install perftest
)? AFAIK, AWS uses EFA. You can try the EFA implementation 'DMLC_ENABLE_RDMA=fabric'
@bobzhuyb Thanks a lot, the AWS instances don't support native RDMA, I will try the EFA implementation.
@bobzhuyb Is the EFA implementation merged into master branch of BytePS? Are there any tutorials or benchmarks? Thank you.
Read this https://github.com/bytedance/ps-lite/blob/byteps/efa.md
You probably need to add USE_FABRIC=1
to this line
https://github.com/bytedance/byteps/blob/master/setup.py#L920
Then set DMLC_ENABLE_RDMA=fabric
when running your program
@bobzhuyb Thank you very much, I will try it.
@bobzhuyb Hi, I built ps-lite with USE_FABRIC=1
flag, and ran test_benchmark successfully. But when I further installed BytePS from source code and tried to launch a scheduler role, I met a OSError lib/python3.6/site-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: fi_freeinfo
. My environment is as follows:
OS: Ubuntu 16.04 GCC version: 5.4.0 and 4.9.3 (I tried both)
Do you have any suggestions?
Did you install libfabric-dev
?
@ymjiang I installed libfabric-aws-dev
, as shown following:
Add the libfabric library to the linker. E.g., here https://github.com/bytedance/byteps/blob/master/setup.py#L331 you may try adding fabric
@bobzhuyb It works for me to add fabric
in L331 of setup.py, as well as add EFA include and library path to INCLUDES
and LIBRARY_DIRS
. Thanks a lot!
@bobzhuyb @ymjiang Hi, I met a new problem when running BytePS with multiple AWS instances, the error message is showed in following figure, which is only printed at workers. I tried to enable RDMAV_FORK_SAFE
or IBV_FORK_SAFE
environment, but made no sense. Do you have any suggestions? My commands are:
# scheduler
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=2
export DMLC_PS_ROOT_URI=ROOT_IP
export DMLC_PS_ROOT_PORT=1234
export DMLC_INTERFACE=ens5
export DMLC_ENABLE_RDMA=fabric
export BYTEPS_ENABLE_IPC=1
bpslaunch
# servers
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=2
export DMLC_PS_ROOT_URI=ROOT_IP
export DMLC_PS_ROOT_PORT=1234
export DMLC_INTERFACE=ens5
export DMLC_ENABLE_RDMA=fabric
export BYTEPS_ENABLE_IPC=1
bpslaunch
# worker 0
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=2
export DMLC_PS_ROOT_URI=ROOT_IP
export DMLC_PS_ROOT_PORT=1234
export DMLC_INTERFACE=ens5
export DMLC_ENABLE_RDMA=fabric
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_WORKER_ID=0
bpslaunch python3 byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 30
You may try removing this line: https://github.com/bytedance/ps-lite/blob/28330e65672a72e07bb7317821b542dca6574356/src/rdma_van.h#L29. Then do make clean
under the ps-lite dir. Finally, recompile BytePS.
My fault. I thought fabric van was inherited from rdma van.
I am confused why it would ever come to rdma van? It should use fabric van.
@bobzhuyb I have no ideas either, I even disabled the ps-lite make option USE_RDMA=1
and only enable USE_FABRIC=1
, the log messages show that it is creating fabric van rather than rdma van.
@bobzhuyb @ymjiang Hi, did you enable hierarchical-allreduce
for Horovod in your OSDI20 paper's evaluations?
@YouhuiBai See section-9.
@ymjiang I mean the hierarchical of Horovod rather than NCCL, Horovod implements hierarchical-allreduce by ReduceScatter
, Bcast
and etc, which can be enabled by an environment variable or parameter of horovodrun
.
Describe the bug I trained BytePS benchmarks shown in
step-by-step-tutorial.md
on AWS EC2 p3dn.24xlarge instance, this instance contains 100Gbps Network, 8 V100 GPUs connected by NVLink, which is mentioned in BytePS paper and README. But I failed to run BytePS on this instance with RDMA enabled. My question is how to adapt BytePS on this instance with RDMA enabled?To Reproduce Steps to reproduce the behavior:
DMLC_ENABLE_RDMA=1
Expected behavior I want to know how to train DNN models atop BytePS in AWS EC2 p3dn.24xlarge instance with RDMA enabled. Thanks.
Screenshots
Environment (please complete the following information):