bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

RDMA: Check failed: mr ibv_reg_mr failed: Cannot allocate memory #372

Closed Ruinhuang closed 3 years ago

Ruinhuang commented 3 years ago

Describe the bug I've tried the following scenario by , and Error occurs. Run resnet50 on 2 nodes, each node with 8 GPUs pbslaunch, NO additional CPU servers https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#:~:text=Distributed%20Training,with%20RDMA

To Reproduce Steps to reproduce the behavior: The steps are exactly the same as the instruction manual (https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#:~:text=Distributed%20Training,with%20RDMA)

i try to set 2 workers, 2 servers and 1 scheduler in this scenario Node1: 1.start scheduler 2.start server 3.start worker

Node2: 1.start server 2.start worker

i start the thread by this sequence: Node1 scheduler->server->worker->Node2 server->worker after i start worker on node 2, error occurs and the error log show on scheduler:

BytePS launching scheduler
Command: python3 -c 'import byteps.server'

[02:19:48] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[02:19:48] src/postoffice.cc:25: Creating Van: 1
[02:19:48] src/van.cc:84: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[02:19:48] src/./rdma_van.h:44: Shared memory IPC has been disabled
[02:19:48] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[02:19:48] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[02:21:08] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:23:25] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:25:36] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:27:34] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:27:35] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[02:27:35] 3rdparty/ps-lite/include/dmlc/logging.h:276: [02:27:35] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22a8c) [0x7f4970e27a8c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ecd) [0x7f4970e27ecd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x787f0) [0x7f4970e7d7f0]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7991b) [0x7f4970e7e91b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f49705056df]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f49721376db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f497247071f]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [02:27:35] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22a8c) [0x7f4970e27a8c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ecd) [0x7f4970e27ecd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x787f0) [0x7f4970e7d7f0]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7991b) [0x7f4970e7e91b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f49705056df]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f49721376db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f497247071f]

Aborted (core dumped)
Traceback (most recent call last):
  File "/usr/local/bin/bpslaunch", line 4, in <module>
    __import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 658, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1438, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in <module>
    launch_bps()
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 216, in launch_bps
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.

Environment (please complete the following information): i use the pytorch docker file(byteps/docker/Dockerfile) to build the container Additional context Do you have any suggestions? my ulimit -l result is unlimited i set BYTEPS_RDMA_START_DEPTH=16 and BYTEPS_RDMA_RX_DEPTH =32 It still shows the same error my byteps code version is lastest

The start sequence is the point?

Ruinhuang commented 3 years ago

i solved this issue by

export BYTEPS_RDMA_RX_DEPTH=1024
export BYTEPS_RDMA_START_DEPTH=64

But how can i set the value According to what conditions to set the parameters?