Open Ruinhuang opened 3 years ago
Hi, @ymjiang do you have any idea? thanks a lot
i have tried to reinstall byteps . but it also can't work
pip3 uninstall -y byteps
python3 setup.py clean
rm -rf byteps
git clone --recursive https://github.com/bytedance/byteps
cd byteps
python3 setup.py install
also I have tried to rebuild ps-lite, it doesn't work
cd byteps/3rdparty/ps-lite/
make clean
make -j USE_RDMA=1
Describe the bug When i run byteps with RDMA in 2 nodes. the node 2 can't bind to node1's scheduler
To Reproduce Steps to reproduce the behavior: 1.build pytorch docker file: docker build -t byteps:pytorch_native . -f Dockerfile_byteps --build-arg FRAMEWORK=pytorch 2.run byteps in 2 nodes (2 workers + 2 servers) https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#distributed-training-with-rdma: node 1 scheduler:
node 2 server:
Stack trace returned 6 entries: [bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/torch/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x43155) [0x7f42f9d0f155] [bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/torch/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x4425d) [0x7f42f9d1025d] [bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/torch/c_lib.cpython-36m-x86_64-linux-gnu.so(+0xd91e5) [0x7f42f9da51e5] [bt] (3) /usr/local/lib/python3.6/dist-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0xedef) [0x7f4305c98def] [bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f43095636db] [bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f430989c71f]
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 inet 193.168.1.135 netmask 255.255.255.0 broadcast 193.168.1.255 inet6 fe80::ba59:9f03:1b:a952 prefixlen 64 scopeid 0x20 unspec 20-00-09-07-FE-80-00-00-00-00-00-00-00-00-00-00 txqueuelen 256 (UNSPEC) RX packets 4637 bytes 1337112 (1.3 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 130 bytes 7896 (7.8 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Dual-port : OFF Device : mlx5_0 Number of qps : 2 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB GID index : 3 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet
local address: LID 0x02 QPN 0x1357 PSN 0xa2869d GID: 254:128:00:00:00:00:00:00:00:00:00:00:00:00:00:00 local address: LID 0x02 QPN 0x1358 PSN 0xa7c8c3 GID: 254:128:00:00:00:00:00:00:00:00:00:00:00:00:00:00 remote address: LID 0x0f QPN 0x0cac PSN 0xc13105 GID: 254:128:00:00:00:00:00:00:00:00:00:00:00:00:00:00 remote address: LID 0x0f QPN 0x0cad PSN 0xafc70b GID: 254:128:00:00:00:00:00:00:00:00:00:00:00:00:00:00
bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
2 1000 6.13 5.84 3.064301 4 1000 18.42 18.38 4.817242 8 1000 38.79 38.32 5.022091 16 1000 74.01 73.73 4.832060 32 1000 154.81 153.05 5.015286 64 1000 284.84 283.69 4.648054 128 1000 594.79 585.45 4.795980 256 1000 1212.09 1134.74 4.647879 512 1000 2401.47 2320.86 4.753125 1024 1000 4836.90 4773.23 4.887785 2048 1000 9217.28 8673.96 4.441065 4096 1000 11237.80 11229.43 2.874734 8192 1000 11299.42 11298.69 1.446232 16384 1000 11358.14 11355.68 0.726763 32768 1000 7668.49 7667.84 0.245371 65536 1000 11389.69 11389.15 0.182226 131072 1000 7688.10 7687.93 0.061503 262144 1000 11397.56 11397.52 0.045590 524288 1000 11396.04 11395.97 0.022792 1048576 1000 10937.69 10023.54 0.010024 2097152 1000 10940.64 9833.39 0.004917 4194304 1000 11394.83 10556.46 0.002639 8388608 1000 11394.62 10179.88 0.001272
mlx5_0 port 1 ==> ib0 (Up) mlx5_1 port 1 ==> ib1 (Up) mlx5_2 port 1 ==> ib2 (Up) mlx5_3 port 1 ==> ib3 (Up)