bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

Failed to train benchmark on AWS EC2 p3dn.24xlarge instance with RDMA #391

Open YouhuiBai opened 3 years ago

YouhuiBai commented 3 years ago

Describe the bug I trained BytePS benchmarks shown in step-by-step-tutorial.md on AWS EC2 p3dn.24xlarge instance, this instance contains 100Gbps Network, 8 V100 GPUs connected by NVLink, which is mentioned in BytePS paper and README. But I failed to run BytePS on this instance with RDMA enabled. My question is how to adapt BytePS on this instance with RDMA enabled?

To Reproduce Steps to reproduce the behavior:

  1. Launch AWS EC2 p3dn.24xlarge instances
  2. Docker pull BytePS images
  3. Run BytePS with DMLC_ENABLE_RDMA=1
  4. See error like the figure in Screenshots

Expected behavior I want to know how to train DNN models atop BytePS in AWS EC2 p3dn.24xlarge instance with RDMA enabled. Thanks.

Screenshots image

Environment (please complete the following information):

bobzhuyb commented 3 years ago

I don't think AWS p3dn.24xlarge supports native RDMA. Can it run ib_write_bw benchmark (you can install it by apt-get install perftest)? AFAIK, AWS uses EFA. You can try the EFA implementation 'DMLC_ENABLE_RDMA=fabric'

YouhuiBai commented 3 years ago

@bobzhuyb Thanks a lot, the AWS instances don't support native RDMA, I will try the EFA implementation.

YouhuiBai commented 3 years ago

@bobzhuyb Is the EFA implementation merged into master branch of BytePS? Are there any tutorials or benchmarks? Thank you.

bobzhuyb commented 3 years ago

Read this https://github.com/bytedance/ps-lite/blob/byteps/efa.md

You probably need to add USE_FABRIC=1 to this line https://github.com/bytedance/byteps/blob/master/setup.py#L920

Then set DMLC_ENABLE_RDMA=fabric when running your program

YouhuiBai commented 3 years ago

@bobzhuyb Thank you very much, I will try it.

YouhuiBai commented 3 years ago

@bobzhuyb Hi, I built ps-lite with USE_FABRIC=1 flag, and ran test_benchmark successfully. But when I further installed BytePS from source code and tried to launch a scheduler role, I met a OSError lib/python3.6/site-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: fi_freeinfo. My environment is as follows:

OS: Ubuntu 16.04 GCC version: 5.4.0 and 4.9.3 (I tried both)

Do you have any suggestions?

ymjiang commented 3 years ago

Did you install libfabric-dev?

YouhuiBai commented 3 years ago

@ymjiang I installed libfabric-aws-dev, as shown following: image

bobzhuyb commented 3 years ago

Add the libfabric library to the linker. E.g., here https://github.com/bytedance/byteps/blob/master/setup.py#L331 you may try adding fabric

YouhuiBai commented 3 years ago

@bobzhuyb It works for me to add fabric in L331 of setup.py, as well as add EFA include and library path to INCLUDES and LIBRARY_DIRS. Thanks a lot!

YouhuiBai commented 3 years ago

@bobzhuyb @ymjiang Hi, I met a new problem when running BytePS with multiple AWS instances, the error message is showed in following figure, which is only printed at workers. I tried to enable RDMAV_FORK_SAFE or IBV_FORK_SAFE environment, but made no sense. Do you have any suggestions? My commands are:

# scheduler
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=2
export DMLC_PS_ROOT_URI=ROOT_IP
export DMLC_PS_ROOT_PORT=1234
export DMLC_INTERFACE=ens5
export DMLC_ENABLE_RDMA=fabric
export BYTEPS_ENABLE_IPC=1
bpslaunch

# servers
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=2
export DMLC_PS_ROOT_URI=ROOT_IP
export DMLC_PS_ROOT_PORT=1234
export DMLC_INTERFACE=ens5
export DMLC_ENABLE_RDMA=fabric
export BYTEPS_ENABLE_IPC=1
bpslaunch

# worker 0
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=2
export DMLC_PS_ROOT_URI=ROOT_IP
export DMLC_PS_ROOT_PORT=1234
export DMLC_INTERFACE=ens5
export DMLC_ENABLE_RDMA=fabric
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_WORKER_ID=0
bpslaunch python3 byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 30

image

ymjiang commented 3 years ago

You may try removing this line: https://github.com/bytedance/ps-lite/blob/28330e65672a72e07bb7317821b542dca6574356/src/rdma_van.h#L29. Then do make clean under the ps-lite dir. Finally, recompile BytePS.

My fault. I thought fabric van was inherited from rdma van.

bobzhuyb commented 3 years ago

I am confused why it would ever come to rdma van? It should use fabric van.

YouhuiBai commented 3 years ago

@bobzhuyb I have no ideas either, I even disabled the ps-lite make option USE_RDMA=1 and only enable USE_FABRIC=1, the log messages show that it is creating fabric van rather than rdma van.

YouhuiBai commented 3 years ago

@bobzhuyb @ymjiang Hi, did you enable hierarchical-allreduce for Horovod in your OSDI20 paper's evaluations?

ymjiang commented 3 years ago

@YouhuiBai See section-9. image

YouhuiBai commented 3 years ago

@ymjiang I mean the hierarchical of Horovod rather than NCCL, Horovod implements hierarchical-allreduce by ReduceScatter, Bcast and etc, which can be enabled by an environment variable or parameter of horovodrun.