bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

Check failed: mr happens when RDMA enabled #369

Open yma11 opened 3 years ago

yma11 commented 3 years ago

Describe the bug Check failed: mr happen on scheduler when RDMA enabled

To Reproduce We have 2 GPU nodes and 2 CPU nodes, and could run byteps using tcp-ip successfully, but when trying to enable RDMA by following https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#distributed-training-with-rdma, we run into below errors: [01:54:54] byteps/server/server.cc:339: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [01:54:54] src/postoffice.cc:20: enable RDMA for networking [01:54:54] src/./rdma_van.h:40: Shared memory IPC has been disabled [01:54:54] src/van.cc:389: Bind to role=scheduler, id=1, ip=10.10.10.54, port=9000, is_recovery=0 [01:54:54] src/./rdma_van.h:131: Connecting to Node 1, My_Node=1 [01:54:54] src/./rdma_van.h:801: OnConnect to Node 1 with Transport=RDMA [01:54:54] src/./rdma_van.h:207: Connect to Node 1 with Transport=RDMA [01:55:01] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA [01:55:01] src/./rdma_van.h:877: OnConnected to Node 2147483647 [01:55:01] src/van.cc:503: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.10.10.54, port=48128, is_recovery=0 } }. THIS IS NOT DATA MSG! [01:55:06] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA [01:55:06] src/./rdma_van.h:877: OnConnected to Node 2147483647 [01:55:06] src/van.cc:503: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.10.10.54, port=36018, is_recovery=0 } }. THIS IS NOT DATA MSG! [01:56:48] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA [01:56:48] src/./rdma_van.h:877: OnConnected to Node 2147483647 [01:56:48] src/van.cc:503: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.10.10.114, port=45347, is_recovery=0 } }. THIS IS NOT DATA MSG! [01:56:50] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA [01:56:50] src/./rdma_van.h:877: OnConnected to Node 2147483647 [01:56:50] src/van.cc:503: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.10.10.113, port=45124, is_recovery=0 } }. THIS IS NOT DATA MSG! [01:56:50] src/van.cc:140: assign rank=8 to node role=server, ip=10.10.10.54, port=36018, is_recovery=0 [01:56:50] src/./rdma_van.h:131: Connecting to Node 8, My_Node=1 [01:56:50] src/./rdma_van.h:877: OnConnected to Node 8 [01:56:50] src/./rdma_van.h:207: Connect to Node 8 with Transport=RDMA [01:56:50] src/van.cc:140: assign rank=10 to node role=server, ip=10.10.10.54, port=48128, is_recovery=0 [01:56:50] src/./rdma_van.h:131: Connecting to Node 10, My_Node=1 [01:56:50] 3rdparty/ps-lite/include/dmlc/logging.h:276: [01:56:50] src/./rdma_transport.h:130: Check failed: mr

Stack trace returned 7 entries: [bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1b98c) [0x7f41769c398c] [bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1bdad) [0x7f41769c3dad] [bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x40fb8) [0x7f41769e8fb8] [bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x57dbb) [0x7f41769ffdbb] [bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7f41760a866f] [bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f41793e36db] [bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f417971c88f]

terminate called after throwing an instance of 'dmlc::Error' what(): [01:56:50] src/./rdma_transport.h:130: Check failed: mr

Stack trace returned 7 entries: [bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1b98c) [0x7f41769c398c] [bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1bdad) [0x7f41769c3dad] [bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x40fb8) [0x7f41769e8fb8] [bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x57dbb) [0x7f41769ffdbb] [bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7f41760a866f] [bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f41793e36db] [bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f417971c88f]

Note that we have verified RDMA network using commands like server: ib_write_bw -R, client:ib_write_bw -R 10.10.10.114

Environment (please complete the following information):

Any idea on this? Thanks in advance.

ymjiang commented 3 years ago

What is your version number of ps-lite?

yma11 commented 3 years ago

How to check version of ps-lite? I am using the docker image provided in https://registry-1.docker.io/.

ymjiang commented 3 years ago

The docker image may not have the latest BytePS code. Would you install with pip3 install byteps==v0.2.5?

And could you check the result of ulimit -l? Registering memory region may fail If this value is too small.