Open wuyujiji opened 4 years ago
Can you try adding the path of libibverbs to $LD_LIBRARY_PATH
?
I check the libibverbs path and find a phenomenon: the warning reminds having not 'xxx-xxx2.so', but my path is 'xxx-xxx22.so', as shown in follows:
libbnxt_re-rdmav22.so libcxgb4-rdmav22.so libhns-rdmav22.so libipathverbs-rdmav22.so libmlx5-rdmav22.so libnes-rdmav22.so libqedr-rdmav22.so libvmw_pvrdma-rdmav22.so
libcxgb3-rdmav22.so libhfi1verbs-rdmav22.so libi40iw-rdmav22.so libmlx4-rdmav22.so libmthca-rdmav22.so libocrdma-rdmav22.so librxe-rdmav22.so
Need I to ln -s xxx-xxx22.so xxx-xxx2.so
?
I tried to ln -s xxx-xxx22.so xxx-xxx2.so
, but no effect:
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libvmw_pvrdma-rdmav2.so)
libibverbs: Warning: couldn't load driver 'cxgb4': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libcxgb4-rdmav2.so)
libibverbs: Warning: couldn't load driver 'mthca': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libmthca-rdmav2.so)
libibverbs: Warning: couldn't load driver 'hns': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libhns-rdmav2.so)
libibverbs: Warning: couldn't load driver 'mlx4': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libmlx4-rdmav2.so)
libibverbs: Warning: couldn't load driver 'ipathverbs': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libipathverbs-rdmav2.so)
libibverbs: Warning: couldn't load driver 'bnxt_re': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libbnxt_re-rdmav2.so)
libibverbs: Warning: couldn't load driver 'cxgb3': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libcxgb3-rdmav2.so)
libibverbs: Warning: couldn't load driver 'ocrdma': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libocrdma-rdmav2.so)
libibverbs: Warning: couldn't load driver 'i40iw': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libi40iw-rdmav2.so)
libibverbs: Warning: couldn't load driver 'qedr': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libqedr-rdmav2.so)
libibverbs: Warning: couldn't load driver 'nes': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libnes-rdmav2.so)
libibverbs: Warning: couldn't load driver 'hfi1verbs': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libhfi1verbs-rdmav2.so)
[23:09:25] src/./rdma_van.h:815: OnConnect to Node 1 with Transport=RDMA
[23:09:25] src/./rdma_van.h:214: Connect to Node 1 with Transport=RDMA
I talk about this question with my colleague, the conclusion is that the following warning is normal, the key point is sever and worker print this error:
3rdparty/ps-lite/include/dmlc/logging.h:276: [19:14:03] src/van.cc:376: Check failed: !ip.empty() failed to get ip
and we don't know the reason
I talk about this question with my colleague, the conclusion is that the following warning is normal
Can you make sure that ib_send_bw
works as expected? Just to confirm the drivers are fine.
And please paste the complete commands that you used to launch the server processes.
it looks like normal.
sever:
# ib_send_bw -d mlx5_0 -F -x 3 -a -q 2
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object file
: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such file
or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such fi
le or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file
: No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No s
uch file or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No suc
h file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such fi
le or directory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such file
or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file:
No such file or directory
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 2 Transport type : IB
Connection type : RC Using SRQ : OFF
RX depth : 512
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x03b1 PSN 0xff02e2
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:144:13
local address: LID 0000 QPN 0x03b2 PSN 0x4ab024
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:144:13
remote address: LID 0000 QPN 0x034b PSN 0xcd1bc1
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:143:17
remote address: LID 0000 QPN 0x034c PSN 0xf7a117
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:143:17q
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
2 2000 0.00 9.64 5.054621
4 2000 0.00 20.06 5.257998
8 2000 0.00 40.19 5.267833
16 2000 0.00 80.22 5.257182
32 2000 0.00 151.48 4.963620
64 2000 0.00 307.79 5.042846
128 2000 0.00 538.72 4.413229
256 2000 0.00 1314.10 5.382569
512 2000 0.00 2562.82 5.248655
1024 2000 0.00 4824.47 4.940260
2048 2000 0.00 10387.11 5.318202
4096 2000 0.00 10844.93 2.776303
8192 2000 0.00 10983.35 1.405868
16384 2000 0.00 10975.69 0.702444
32768 2000 0.00 11045.22 0.353447
65536 2000 0.00 10771.54 0.172345
131072 2000 0.00 11040.74 0.088326
262144 2000 0.00 11046.52 0.044186
524288 2000 0.00 10991.94 0.021984
1048576 2000 0.00 11048.35 0.011048
2097152 2000 0.00 11037.66 0.005519
4194304 2000 0.00 11051.34 0.002763
8388608 2000 0.00 11049.53 0.001381
---------------------------------------------------------------------------------------
client:
# ib_send_bw -d mlx5_0 -F -x 3 -a -q 2 10.137.144.13
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such fil$
or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such fil$
or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file:
No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No $
uch file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such f$
le or directory
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object fil$
: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object fil$
: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such f$
le or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No su$
h file or directory
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 2 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x034b PSN 0xcd1bc1
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:143:17
local address: LID 0000 QPN 0x034c PSN 0xf7a117
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:143:17
remote address: LID 0000 QPN 0x03b1 PSN 0xff02e2
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:144:13
remote address: LID 0000 QPN 0x03b2 PSN 0x4ab024
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:144:13
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
2 2000 10.07 9.16 4.803076
4 2000 19.81 19.56 5.127040
8 2000 39.36 39.23 5.142463
16 2000 78.72 78.35 5.134476
32 2000 148.77 148.38 4.862031
64 2000 303.99 300.87 4.929454
128 2000 625.58 528.20 4.327051
256 2000 1288.46 1284.67 5.262027
512 2000 2530.48 2501.99 5.124083
1024 2000 4730.67 4720.26 4.833546
2048 2000 9855.57 9839.47 5.037811
4096 2000 10624.45 10615.65 2.717606
8192 2000 10868.54 10857.09 1.389708
16384 2000 10914.73 10912.28 0.698386
32768 2000 11009.99 11009.29 0.352297
65536 2000 10965.50 10751.23 0.172020
131072 2000 11027.72 11027.69 0.088222
262144 2000 11036.82 11036.75 0.044147
524288 2000 11021.59 10983.65 0.021967
1048576 2000 11038.09 11038.08 0.011038
2097152 2000 11035.63 11026.87 0.005513
4194304 2000 11040.46 11040.45 0.002760
8388608 2000 11039.53 11038.55 0.001380
---------------------------------------------------------------------------------------
I paste the output of ifconfig
, ibdev2netdev
and ibv_devinfo
:
ifconfig:
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.137.144.13 netmask 255.255.255.0 broadcast 10.137.144.255
ether ec:0d:9a:ab:54:0a txqueuelen 1000 (Ethernet)
RX packets 83975381 bytes 2792479211029 (2.5 TiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 91259664 bytes 2567129665617 (2.3 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 0.0.0.0
ether 02:42:ab:ca:63:58 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether ec:0d:9a:ab:54:0a txqueuelen 1000 (Ethernet)
RX packets 1956575207 bytes 2914010407720 (2.6 TiB)
RX errors 0 dropped 102252 overruns 0 frame 0
TX packets 1817041401 bytes 2680954427903 (2.4 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (Local Loopback)
RX packets 30334953 bytes 30205847277 (28.1 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 30334953 bytes 30205847277 (28.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ibdev2netdev:
mlx5_0 port 1 ==> eth0 (Up)
mlx5_1 port 1 ==> eth1 (Down)
ibv_devinfo:
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file: No such file or directory
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.27.2008
node_guid: ec0d:9a03:00ab:540a
sys_image_guid: ec0d:9a03:00ab:540a
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
Here is my running script of scheduler and sever: scheduler:
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
# the RDMA interface name of the scheduler
export DMLC_ENABLE_RDMA=1
export DMLC_INTERFACE=eth0
export DMLC_PS_ROOT_URI=10.137.144.13 # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
bpslaunch
server:
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
# the RDMA interface name of the scheduler
export DMLC_ENABLE_RDMA=1
export DMLC_INTERFACE=eth0
export DMLC_PS_ROOT_URI=10.137.144.13 # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
bpslaunch
Seems that your eth0 does not have an available ip. You can either set DMLC_INTERFACE=br0
, or manually add DMLC_NODE_HOST=10.137.144.13
.
Thanks a lot ,this problem is solved. But I met another problem:Check failed: mr ibv_reg_mr failed: Cannot allocate memory. According to https://github.com/bytedance/byteps/issues/282 and https://github.com/bytedance/byteps/issues/216, this bug seems to fix and merge into master branch. I build byteps withpip install
and the version is 0.2.4. Does this version merges this PR?
v0.2.4 does not contain the fix, sorry. We will release v0.2.5 on pypi in a few days.
You can also install from the source code using v0.2.5.
It means the v0.2.5 source has fixed this issue? can I directly build v0.2.5 from source without any change?
I just checked and you still need to change the source code a little bit. The correct process is: Pull byteps v0.2.5, and change the ps-lite submodule to 7e4800fe
, then compile byteps using python3 setup.py install
.
Apologize for the inconvenience. We will fix this in https://github.com/bytedance/byteps/pull/316.
OK, thank you very much, I will try it soon.
I just checked and you still need to change the source code a little bit. The correct process is: Pull byteps v0.2.5, and change the ps-lite submodule to
7e4800fe
, then compile byteps usingpython3 setup.py install
.Apologize for the inconvenience. We will fix this in #316.
Hello, based on your suggesstion, I checkout the commit 7e4800f
and the problem also occurs:
Here is my build command:
git clone --recursive --branch v0.2.5 --single-branch --depth 1 https://github.com/bytedance/byteps.git
cd byteps/3rdparty/ps-lite
git checkout 7e4800fe
cd ../../ && source scl_source enable devtoolset-4
BYTEPS_NCCL_LINK=shared BYTEPS_USE_RDMA=1 BYTEPS_WITHOUT_MXNET=1 python3 setup.py install
the error is:
[15:59:19] 3rdparty/ps-lite/include/dmlc/logging.h:276: [15:59:19] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr f
ailed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib64/python3.6/site-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64
-linux-gnu.so(+0x2406b) [0x7ff7385b606b]
[bt] (1) /usr/local/lib64/python3.6/site-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64
-linux-gnu.so(+0x249a9) [0x7ff7385b69a9]
[bt] (2) /usr/local/lib64/python3.6/site-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64
-linux-gnu.so(+0x70e48) [0x7ff738602e48]
[bt] (3) /usr/local/lib64/python3.6/site-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64
-linux-gnu.so(+0x71c9b) [0x7ff738603c9b]
[bt] (4) /usr/lib64/libstdc++.so.6(+0xc8421) [0x7ff737ea9421]
[bt] (5) /usr/lib64/libpthread.so.0(+0x7e65) [0x7ff739f67e65]
[bt] (6) /usr/lib64/libc.so.6(clone+0x6d) [0x7ff73958788d]
terminate called after throwing an instance of 'dmlc::Error'
what(): [15:59:19] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
the output of ulimit -l
is 131072
Have you tried tuning the value of BYTEPS_RDMA_RX_DEPTH
and BYTEPS_RDMA_START_DEPTH
?
No,Did you PR not solve this problem? How did I set the value of BYTEPS_RDMA_RX_DEPTH
and BYTEPS_RDMA_START_DEPTH
? Whether using export
in terminal and then rerunning, or set this two value and then recompiling?
The PR only makes the value configurable.
Whether using
export
in terminal and then rerunning, or set this two value and then recompiling?
No need to recompile. Just export the value and then rerun.
Need I only set the value of BYTEPS_RDMA_RX_DEPTH
and BYTEPS_RDMA_START_DEPTH
in scheduler, or the worker and server also need to set? How much the value is proper?
when I export BYTEPS_RDMA_RX_DEPTH=1024
and export BYTEPS_RDMA_START_DEPTH=64
in scheduler, server and two workers, the scheduler is normal, but the server occurs the errors:
[17:30:56] src/./rdma_van.h:230: Connect to Node 1 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 1 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 11 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 9 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 8 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 9 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 8 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 11 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 1 with Transport=RDMA
[17:31:13] 3rdparty/ps-lite/include/dmlc/logging.h:276: [17:31:13] src/./rdma_utils.h:119: Check failed: mr = ibv_reg_mr(pd_
, p, size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE) Failed to register the memory region: Cannot allocate memory, sa.size()=2359296
according to https://github.com/bytedance/byteps/issues/282#issuecomment-669082744, his problem is sloved by change the sequence of running scheduler, worker and server. I want to know the different execution sequence can cause the error of RDMA registers memory region? Furthermore, what is the correct execution order?
when I export
BYTEPS_RDMA_RX_DEPTH=1024
and exportBYTEPS_RDMA_START_DEPTH=64
in scheduler, server and two workers, the scheduler is normal, but the server occurs the errors:[17:30:56] src/./rdma_van.h:230: Connect to Node 1 with Transport=RDMA [17:31:12] src/./rdma_van.h:831: OnConnect to Node 1 with Transport=RDMA [17:31:12] src/./rdma_van.h:831: OnConnect to Node 11 with Transport=RDMA [17:31:12] src/./rdma_van.h:230: Connect to Node 9 with Transport=RDMA [17:31:12] src/./rdma_van.h:831: OnConnect to Node 8 with Transport=RDMA [17:31:12] src/./rdma_van.h:831: OnConnect to Node 9 with Transport=RDMA [17:31:12] src/./rdma_van.h:230: Connect to Node 8 with Transport=RDMA [17:31:12] src/./rdma_van.h:230: Connect to Node 11 with Transport=RDMA [17:31:12] src/./rdma_van.h:230: Connect to Node 1 with Transport=RDMA [17:31:13] 3rdparty/ps-lite/include/dmlc/logging.h:276: [17:31:13] src/./rdma_utils.h:119: Check failed: mr = ibv_reg_mr(pd_ , p, size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE) Failed to register the memory region: Cannot allocate memory, sa.size()=2359296
This is caused by not having enough resources for registering the memory buffers.
Here are a few things to try: (duplicate of https://github.com/bytedance/byteps/issues/216#issuecomment-596891713)
I tried these two ways and the problem is still not solved. In https://www.rdmamojo.com/2012/09/07/ibv_reg_mr/, it shows that the another possible for register MR failed is no permission write (official: read only memory cannot be registered with write permissions (either local or remote)). My docker is runned without root permission, I don't know whether the write permission is caused by this error.
I tried these two ways and the problem is still not solved. In https://www.rdmamojo.com/2012/09/07/ibv_reg_mr/, it shows that the another possible for register MR failed is no permission write (official: read only memory cannot be registered with write permissions (either local or remote)). My docker is runned without root permission, I don't know whether the write permission is caused by this error.
Can you run this benchmark? https://github.com/bytedance/ps-lite#1-basic-benchmark
If it works, then the problem is not related to the permission.
the output of tests/test_benchmark:
[15:38:55] src/postoffice.cc:25: Creating Van: ibverbs
[15:38:55] src/./rdma_van.h:44: Shared memory IPC has been disabled
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No such file or directory
[15:38:56] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[15:38:56] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[15:38:56] src/./rdma_van.h:806: OnConnect to Node 8 with Transport=RDMA
[15:38:56] src/./rdma_van.h:806: OnConnect to Node 9 with Transport=RDMA
[15:38:56] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[15:38:56] src/./rdma_van.h:234: Connect to Node 8 with Transport=RDMA
[15:38:56] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[15:38:56] tests/test_benchmark.cc:177: 1 servers in total
[15:38:56] tests/test_benchmark.cc:111: ========= PUSH_PULL mode =========
[15:38:56] tests/test_benchmark.cc:112: ========= msg_size=1024000 bytes =========
[15:38:56] tests/test_benchmark.cc:164: Application goodput: 73.2216 Gbps. count = 10
the output of tests/test_ipc_benchmark:
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 87.1557 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.3595 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.4327 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.291 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.8608 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.6724 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.9319 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.9671 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.0182 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.1639 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.991 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.6009 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.349 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.1734 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 85.3423 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.2997 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 81.7764 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 83.8365 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 83.5674 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 83.6138 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 83.904 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 83.3367 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 82.7374 Gbps
it seems works well
hello, the above error is solved by increase the value of ulimit -l
. However, I met the next problem which is similar to https://github.com/bytedance/byteps/issues/282#issuecomment-669636652. According to his solution, I checked my PFC config and confirmed it's enable. Therefore, I don't know the reason my performance is unstable. Here is my ipc-benchmark
output for LOG_DURATION=100
:
[18:27:09] src/./rdma_van.h:234: Connect to Node 11 with Transport=RDMA
[18:27:09] src/./rdma_van.h:234: Connect to Node 10 with Transport=RDMA
[18:27:09] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[18:27:09] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[18:27:09] tests/test_ipc_benchmark.cc:174: 2 servers in total
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=0, name=BytePS_ShM_0
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=65536, name=BytePS_ShM_65536
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=131072, name=BytePS_ShM_131072
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=196608, name=BytePS_ShM_196608
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=262144, name=BytePS_ShM_262144
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=327680, name=BytePS_ShM_327680
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=393216, name=BytePS_ShM_393216
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=458752, name=BytePS_ShM_458752
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=524288, name=BytePS_ShM_524288
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=589824, name=BytePS_ShM_589824
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=655360, name=BytePS_ShM_655360
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=720896, name=BytePS_ShM_720896
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=786432, name=BytePS_ShM_786432
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=851968, name=BytePS_ShM_851968
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=917504, name=BytePS_ShM_917504
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=983040, name=BytePS_ShM_983040
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=1048576, name=BytePS_ShM_1048576
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=1114112, name=BytePS_ShM_1114112
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=1179648, name=BytePS_ShM_1179648
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=1245184, name=BytePS_ShM_1245184
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 81.3902 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 82.7188 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 106.109 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 155.804 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 155.168 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 156.249 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 155.543 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 154.531 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 154.711 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 140.62 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 113.722 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 125.809 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 142.547 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 153.604 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 155.389 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 155.288 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 153.965 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 148.493 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 153.369 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 150.847 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 130.986 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 124.084 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 138.677 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 142.156 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 135.113 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 153.085 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 154.343 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 153.031 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 134.72 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 121.617 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 136.824 Gbps
[18:27:14] tests/test_ipc_benchmark.cc:136: Application goodput: 118.423 Gbps
[18:27:14] tests/test_ipc_benchmark.cc:136: Application goodput: 130.066 Gbps
[18:27:14] tests/test_ipc_benchmark.cc:136: Application goodput: 149.571 Gbps
[18:27:14] tests/test_ipc_benchmark.cc:136: Application goodput: 155.2 Gbps
[18:27:14] tests/test_ipc_benchmark.cc:136: Application goodput: 153.134 Gbps
[18:27:18] tests/test_ipc_benchmark.cc:136: Application goodput: 4.19001 Gbps
[18:27:18] tests/test_ipc_benchmark.cc:136: Application goodput: 82.6032 Gbps
[18:27:18] tests/test_ipc_benchmark.cc:136: Application goodput: 82.622 Gbps
[18:27:18] tests/test_ipc_benchmark.cc:136: Application goodput: 83.2497 Gbps
[18:27:22] tests/test_ipc_benchmark.cc:136: Application goodput: 4.37797 Gbps
[18:27:22] tests/test_ipc_benchmark.cc:136: Application goodput: 80.2225 Gbps
[18:27:26] tests/test_ipc_benchmark.cc:136: Application goodput: 4.0499 Gbps
[18:27:27] tests/test_ipc_benchmark.cc:136: Application goodput: 81.4794 Gbps
I checked my PFC config and confirmed it's enable
~How did you confirm it? From the log of ib_send_bw
you posted above, the bandwidth looks quite low (10MB/s). I am confused by the results since they are inconsistent with your test_ipc_benchmark.~
(edited due to misread)
PS: You can use the test_benchmark to test the 1v1 RDMA performance.
I checked my PFC config and confirmed it's enable
How did you confirm it? From the log of
ib_send_bw
you posted above, the bandwidth looks quite low (~10MB/s). I am confused by the results since they are inconsistent with your test_ipc_benchmark.PS: You can use the test_benchmark to test the 1v1 RDMA performance.
You misread the ib_send_bw bandwidth unit. The performance is expected.
@wuyujiji You can check the counters in this folder /sys/class/infiniband/mlx5_1/ports/1/hw_counters/
(or change mlx5_1 to other IB device according to your setup). out_of_sequence
, rp_cnp_handled
, np_ecn_marked_roce_packets
etc. should give you a good understanding of the network level details. Those counters should not grow when you run 1-to-1 traffic.
@bobzhuyb Hi, I am not familiar with RDMA, when I test 1-to-1 traffic (test_banchmark.cc), the program finished quckily, the output is:
[10:50:51] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[10:50:51] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[10:50:51] src/./rdma_van.h:234: Connect to Node 8 with Transport=RDMA
[10:50:51] src/./rdma_van.h:806: OnConnect to Node 9 with Transport=RDMA
[10:50:51] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[10:50:51] src/./rdma_van.h:806: OnConnect to Node 8 with Transport=RDMA
[10:50:51] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[10:50:51] tests/test_benchmark.cc:177: 1 servers in total
[10:50:51] tests/test_benchmark.cc:111: ========= PUSH_PULL mode =========
[10:50:51] tests/test_benchmark.cc:112: ========= msg_size=1024000 bytes =========
[10:50:51] tests/test_benchmark.cc:164: Application goodput: 76.5938 Gbps. count = 10
when checking out_of_sequence
, rp_cnp_handled
, np_ecn_marked_roce_packets
, and np_cnp_sent
, all value have no increasing, I don't konw whether the program executes shortly to have no increased.
In addtion, when I run the test_ipc_banchmark.cc for about five minutes, In out_of_sequence
, rp_cnp_handled
, np_ecn_marked_roce_packets
, and np_cnp_sent
, only out_of_sequence
and rp_cnp_handled
increase.
I did another experiment, when reducing to 1worker and 1server in one machine and running the test_ipc_banchmark.cc, the out_of_sequence
and rp_cnp_handled
have never change.
out_of_sequence
growing means there is packet drop, so probably PFC is not enabled. You should ask your system admin about PFC configuration.. You may be able to check the configuration by mlnx_qos -i eth0
(or other interfaces)
my system admin checks that the PFC config is enable.
the output of mlnx_qos -i eth0
is:
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
prio:0 dscp:07,06,05,04,03,02,01,00,
prio:1 dscp:15,14,13,12,11,10,09,08,
prio:2 dscp:23,22,21,20,19,18,17,16,
prio:3 dscp:31,30,29,28,27,26,25,24,
prio:4 dscp:39,38,37,36,35,34,33,32,
prio:5 dscp:47,46,45,44,43,42,41,40,
prio:6 dscp:55,54,53,52,51,50,49,48,
prio:7 dscp:63,62,61,60,59,58,57,56,
Receive buffer size (bytes): 130944,130944,0,0,0,0,0,0,
Cable len: 7
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 1 0 0 0 0
buffer 0 0 0 1 0 0 0 0
tc: 0 ratelimit: unlimited, tsa: vendor
priority: 1
tc: 1 ratelimit: unlimited, tsa: vendor
priority: 0
tc: 2 ratelimit: unlimited, tsa: vendor
priority: 2
tc: 3 ratelimit: unlimited, tsa: vendor
priority: 3
tc: 4 ratelimit: unlimited, tsa: vendor
priority: 4
tc: 5 ratelimit: unlimited, tsa: vendor
priority: 5
tc: 6 ratelimit: unlimited, tsa: vendor
priority: 6
tc: 7 ratelimit: unlimited, tsa: vendor
priority: 7
You have new mail in /var/spool/mail/root
I am sorry that I don't know whether PFC config is enable. Could you please help me check this? thanks a lot!
@bobzhuyb @ymjiang hello,did you have the random number experiment result of byteps/example/pytorch/benchmark_byteps.py
,I want to check my corresponding results
Describe the bug Excuse me, based on https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md, when I run distributed training with RDMA, the scheduler will print the following warning:
then the server and worker will print error:
could you please help me? thanks a lot!