i try to set 2 workers, 2 servers and 1 scheduler in this scenario
Node1:
1.start scheduler
2.start server
3.start worker
Node2:
1.start server
2.start worker
i start the thread by this sequence:
Node1 scheduler->server->worker->Node2 server->worker
after i start worker on node 2, error occurs and the error log show on scheduler:
BytePS launching scheduler
Command: python3 -c 'import byteps.server'
[02:19:48] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[02:19:48] src/postoffice.cc:25: Creating Van: 1
[02:19:48] src/van.cc:84: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[02:19:48] src/./rdma_van.h:44: Shared memory IPC has been disabled
[02:19:48] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[02:19:48] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[02:21:08] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:23:25] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:25:36] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:27:34] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:27:35] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[02:27:35] 3rdparty/ps-lite/include/dmlc/logging.h:276: [02:27:35] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22a8c) [0x7f4970e27a8c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ecd) [0x7f4970e27ecd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x787f0) [0x7f4970e7d7f0]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7991b) [0x7f4970e7e91b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f49705056df]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f49721376db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f497247071f]
terminate called after throwing an instance of 'dmlc::Error'
what(): [02:27:35] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22a8c) [0x7f4970e27a8c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ecd) [0x7f4970e27ecd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x787f0) [0x7f4970e7d7f0]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7991b) [0x7f4970e7e91b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f49705056df]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f49721376db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f497247071f]
Aborted (core dumped)
Traceback (most recent call last):
File "/usr/local/bin/bpslaunch", line 4, in <module>
__import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 658, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1438, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in <module>
launch_bps()
File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 216, in launch_bps
stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.
Environment (please complete the following information):
i use the pytorch docker file(byteps/docker/Dockerfile) to build the container
Additional context
Do you have any suggestions?
my ulimit -l result is unlimited
i set BYTEPS_RDMA_START_DEPTH=16 and BYTEPS_RDMA_RX_DEPTH =32 It still shows the same error
my byteps code version is lastest
Describe the bug I've tried the following scenario by , and Error occurs. Run resnet50 on 2 nodes, each node with 8 GPUs pbslaunch, NO additional CPU servers https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#:~:text=Distributed%20Training,with%20RDMA
To Reproduce Steps to reproduce the behavior: The steps are exactly the same as the instruction manual (https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#:~:text=Distributed%20Training,with%20RDMA)
i try to set 2 workers, 2 servers and 1 scheduler in this scenario Node1: 1.start scheduler 2.start server 3.start worker
Node2: 1.start server 2.start worker
i start the thread by this sequence: Node1 scheduler->server->worker->Node2 server->worker after i start worker on node 2, error occurs and the error log show on scheduler:
Environment (please complete the following information): i use the pytorch docker file(byteps/docker/Dockerfile) to build the container Additional context Do you have any suggestions? my ulimit -l result is unlimited i set
BYTEPS_RDMA_START_DEPTH=16
andBYTEPS_RDMA_RX_DEPTH =32
It still shows the same error my byteps code version is lastestThe start sequence is the point?