bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

RDMA_CM_EVENT_ADDR_ERROR #377

Open Ruinhuang opened 3 years ago

Ruinhuang commented 3 years ago

Describe the bug When i run byteps with RDMA in 2 nodes. the node 2 can't bind to node1's scheduler

To Reproduce Steps to reproduce the behavior: 1.build pytorch docker file: docker build -t byteps:pytorch_native . -f Dockerfile_byteps --build-arg FRAMEWORK=pytorch 2.run byteps in 2 nodes (2 workers + 2 servers) https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#distributed-training-with-rdma: node 1 scheduler:

nvidia-docker run -it --net=host --ulimit memlock=-1 --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:pytorch_native
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=2

export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=193.168.1.135
export DMLC_PS_ROOT_PORT=3333
export BYTEPS_RDMA_RX_DEPTH=1024
export BYTEPS_RDMA_START_DEPTH=64 
PS_VERBOSE=2 bpslaunch

node 2 server:

nvidia-docker run -it --net=host --ulimit memlock=-1 --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:pytorch_native

export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=2
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=193.168.1.135
export DMLC_PS_ROOT_PORT=3333
export BYTEPS_RDMA_RX_DEPTH=1024
export BYTEPS_RDMA_START_DEPTH=64
  1. See error node 2 server error
    
    BytePS launching worker
    [03:25:35] src/postoffice.cc:25: Creating Van: 1
    [03:25:35] src/van.cc:84: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
    [03:25:35] src/./rdma_van.h:44: Shared memory IPC has been disabled
    [03:25:35] src/van.cc:441: Bind to [role=worker, ip=193.168.1.134, port=36411, is_recovery=0, aux_id=-1]
    [03:25:35] src/./rdma_van.h:155: Connecting to Node 1, My_Node=2147483647
    [03:25:35] 3rdparty/ps-lite/include/dmlc/logging.h:276: [03:25:35] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR)

Stack trace returned 6 entries: [bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/torch/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x43155) [0x7f42f9d0f155] [bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/torch/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x4425d) [0x7f42f9d1025d] [bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/torch/c_lib.cpython-36m-x86_64-linux-gnu.so(+0xd91e5) [0x7f42f9da51e5] [bt] (3) /usr/local/lib/python3.6/dist-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0xedef) [0x7f4305c98def] [bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f43095636db] [bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f430989c71f]


**Additional context**
1. node1 worker and server can bind to node1's scheduler, node 2's worker also can't bind to node1's scheduler
2. my ifconfig info:

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 inet 193.168.1.135 netmask 255.255.255.0 broadcast 193.168.1.255 inet6 fe80::ba59:9f03:1b:a952 prefixlen 64 scopeid 0x20 unspec 20-00-09-07-FE-80-00-00-00-00-00-00-00-00-00-00 txqueuelen 256 (UNSPEC) RX packets 4637 bytes 1337112 (1.3 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 130 bytes 7896 (7.8 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

3.ib_send_bw

                Send BW Test

Dual-port : OFF Device : mlx5_0 Number of qps : 2 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB GID index : 3 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet

local address: LID 0x02 QPN 0x1357 PSN 0xa2869d GID: 254:128:00:00:00:00:00:00:00:00:00:00:00:00:00:00 local address: LID 0x02 QPN 0x1358 PSN 0xa7c8c3 GID: 254:128:00:00:00:00:00:00:00:00:00:00:00:00:00:00 remote address: LID 0x0f QPN 0x0cac PSN 0xc13105 GID: 254:128:00:00:00:00:00:00:00:00:00:00:00:00:00:00 remote address: LID 0x0f QPN 0x0cad PSN 0xafc70b GID: 254:128:00:00:00:00:00:00:00:00:00:00:00:00:00:00

bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

2 1000 6.13 5.84 3.064301 4 1000 18.42 18.38 4.817242 8 1000 38.79 38.32 5.022091 16 1000 74.01 73.73 4.832060 32 1000 154.81 153.05 5.015286 64 1000 284.84 283.69 4.648054 128 1000 594.79 585.45 4.795980 256 1000 1212.09 1134.74 4.647879 512 1000 2401.47 2320.86 4.753125 1024 1000 4836.90 4773.23 4.887785 2048 1000 9217.28 8673.96 4.441065 4096 1000 11237.80 11229.43 2.874734 8192 1000 11299.42 11298.69 1.446232 16384 1000 11358.14 11355.68 0.726763 32768 1000 7668.49 7667.84 0.245371 65536 1000 11389.69 11389.15 0.182226 131072 1000 7688.10 7687.93 0.061503 262144 1000 11397.56 11397.52 0.045590 524288 1000 11396.04 11395.97 0.022792 1048576 1000 10937.69 10023.54 0.010024 2097152 1000 10940.64 9833.39 0.004917 4194304 1000 11394.83 10556.46 0.002639 8388608 1000 11394.62 10179.88 0.001272

4. ```ulimit -l``` is unlimited
5. ```ibdev2netdev``` is:

mlx5_0 port 1 ==> ib0 (Up) mlx5_1 port 1 ==> ib1 (Up) mlx5_2 port 1 ==> ib2 (Up) mlx5_3 port 1 ==> ib3 (Up)

Ruinhuang commented 3 years ago

Hi, @ymjiang do you have any idea? thanks a lot

Ruinhuang commented 3 years ago

i have tried to reinstall byteps . but it also can't work

pip3 uninstall -y byteps
python3 setup.py clean
rm -rf byteps
git clone --recursive https://github.com/bytedance/byteps
cd byteps
python3 setup.py install

also I have tried to rebuild ps-lite, it doesn't work

cd  byteps/3rdparty/ps-lite/
make clean
make -j USE_RDMA=1