bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 488 forks source link

run distributed training with RDMA reports the libibverbs warning #313

Open wuyujiji opened 4 years ago

wuyujiji commented 4 years ago

Describe the bug Excuse me, based on https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md, when I run distributed training with RDMA, the scheduler will print the following warning:

BytePS launching scheduler
[19:05:14] byteps/server/server.cc:339: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[19:05:14] src/postoffice.cc:20: enable RDMA for networking
[19:05:14] src/./rdma_van.h:40: Shared memory IPC has been disabled
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No such file or directory
[19:05:15] src/./rdma_van.h:815: OnConnect to Node 1 with Transport=RDMA
[19:05:15] src/./rdma_van.h:214: Connect to Node 1 with Transport=RDMA

then the server and worker will print error:

BytePS launching server
[19:14:03] byteps/server/server.cc:339: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[19:14:03] src/postoffice.cc:20: enable RDMA for networking
[19:14:03] src/./rdma_van.h:40: Shared memory IPC has been disabled
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No such file or directory
[19:14:03] 3rdparty/ps-lite/include/dmlc/logging.h:276: [19:14:03] src/van.cc:376: Check failed: !ip.empty() failed to get ip

could you please help me? thanks a lot!

ymjiang commented 4 years ago

Can you try adding the path of libibverbs to $LD_LIBRARY_PATH?

wuyujiji commented 4 years ago

I check the libibverbs path and find a phenomenon: the warning reminds having not 'xxx-xxx2.so', but my path is 'xxx-xxx22.so', as shown in follows:

libbnxt_re-rdmav22.so  libcxgb4-rdmav22.so      libhns-rdmav22.so    libipathverbs-rdmav22.so  libmlx5-rdmav22.so   libnes-rdmav22.so     libqedr-rdmav22.so  libvmw_pvrdma-rdmav22.so
libcxgb3-rdmav22.so    libhfi1verbs-rdmav22.so  libi40iw-rdmav22.so  libmlx4-rdmav22.so        libmthca-rdmav22.so  libocrdma-rdmav22.so  librxe-rdmav22.so

Need I to ln -s xxx-xxx22.so xxx-xxx2.so?

wuyujiji commented 4 years ago

I tried to ln -s xxx-xxx22.so xxx-xxx2.so, but no effect:

libibverbs: Warning: couldn't load driver 'vmw_pvrdma': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libvmw_pvrdma-rdmav2.so)
libibverbs: Warning: couldn't load driver 'cxgb4': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libcxgb4-rdmav2.so)
libibverbs: Warning: couldn't load driver 'mthca': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libmthca-rdmav2.so)
libibverbs: Warning: couldn't load driver 'hns': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libhns-rdmav2.so)
libibverbs: Warning: couldn't load driver 'mlx4': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libmlx4-rdmav2.so)
libibverbs: Warning: couldn't load driver 'ipathverbs': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libipathverbs-rdmav2.so)
libibverbs: Warning: couldn't load driver 'bnxt_re': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libbnxt_re-rdmav2.so)
libibverbs: Warning: couldn't load driver 'cxgb3': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libcxgb3-rdmav2.so)
libibverbs: Warning: couldn't load driver 'ocrdma': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libocrdma-rdmav2.so)
libibverbs: Warning: couldn't load driver 'i40iw': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libi40iw-rdmav2.so)
libibverbs: Warning: couldn't load driver 'qedr': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libqedr-rdmav2.so)
libibverbs: Warning: couldn't load driver 'nes': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libnes-rdmav2.so)
libibverbs: Warning: couldn't load driver 'hfi1verbs': /usr/local/nvidia/cpu_lib/libibverbs.so.1: version `IBVERBS_PRIVATE_22' not found (required by /usr/lib64/libibverbs/libhfi1verbs-rdmav2.so)
[23:09:25] src/./rdma_van.h:815: OnConnect to Node 1 with Transport=RDMA
[23:09:25] src/./rdma_van.h:214: Connect to Node 1 with Transport=RDMA
wuyujiji commented 4 years ago

I talk about this question with my colleague, the conclusion is that the following warning is normal, the key point is sever and worker print this error: 3rdparty/ps-lite/include/dmlc/logging.h:276: [19:14:03] src/van.cc:376: Check failed: !ip.empty() failed to get ip and we don't know the reason

ymjiang commented 4 years ago

I talk about this question with my colleague, the conclusion is that the following warning is normal

Can you make sure that ib_send_bw works as expected? Just to confirm the drivers are fine.

And please paste the complete commands that you used to launch the server processes.

wuyujiji commented 4 years ago

it looks like normal.

sever:

# ib_send_bw -d mlx5_0 -F -x 3 -a -q 2

libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object file
: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such file
 or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such fi
le or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file
: No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No s
uch file or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No suc
h file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such fi
le or directory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such file
 or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file:
No such file or directory

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 2            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 RX depth        : 512
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x03b1 PSN 0xff02e2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:144:13
 local address: LID 0000 QPN 0x03b2 PSN 0x4ab024
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:144:13
 remote address: LID 0000 QPN 0x034b PSN 0xcd1bc1
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:143:17
 remote address: LID 0000 QPN 0x034c PSN 0xf7a117
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:143:17q
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          2000             0.00               9.64               5.054621
 4          2000             0.00               20.06              5.257998
 8          2000             0.00               40.19              5.267833
 16         2000             0.00               80.22              5.257182
 32         2000             0.00               151.48             4.963620
 64         2000             0.00               307.79             5.042846
 128        2000             0.00               538.72             4.413229
 256        2000             0.00               1314.10            5.382569
 512        2000             0.00               2562.82            5.248655
 1024       2000             0.00               4824.47            4.940260
 2048       2000             0.00               10387.11                   5.318202
 4096       2000             0.00               10844.93                   2.776303
 8192       2000             0.00               10983.35                   1.405868
 16384      2000             0.00               10975.69                   0.702444
 32768      2000             0.00               11045.22                   0.353447
 65536      2000             0.00               10771.54                   0.172345
 131072     2000             0.00               11040.74                   0.088326
 262144     2000             0.00               11046.52                   0.044186
 524288     2000             0.00               10991.94                   0.021984
 1048576    2000             0.00               11048.35                   0.011048
 2097152    2000             0.00               11037.66                   0.005519
 4194304    2000             0.00               11051.34                   0.002763
 8388608    2000             0.00               11049.53                   0.001381
---------------------------------------------------------------------------------------

client:

# ib_send_bw -d mlx5_0 -F -x 3 -a -q 2 10.137.144.13

libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such fil$
 or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such fil$
 or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file:
No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No $
uch file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such f$
le or directory
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object fil$
: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object fil$
: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such
file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such f$
le or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No su$
h file or directory
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 2            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x034b PSN 0xcd1bc1
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:143:17
 local address: LID 0000 QPN 0x034c PSN 0xf7a117
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:143:17
 remote address: LID 0000 QPN 0x03b1 PSN 0xff02e2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:144:13
 remote address: LID 0000 QPN 0x03b2 PSN 0x4ab024
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:137:144:13
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          2000             10.07              9.16               4.803076
 4          2000             19.81              19.56              5.127040
 8          2000             39.36              39.23              5.142463
 16         2000             78.72              78.35              5.134476
 32         2000             148.77             148.38             4.862031
 64         2000             303.99             300.87             4.929454
 128        2000             625.58             528.20             4.327051
 256        2000             1288.46            1284.67            5.262027
 512        2000             2530.48            2501.99            5.124083
 1024       2000             4730.67            4720.26            4.833546
 2048       2000             9855.57            9839.47            5.037811
 4096       2000             10624.45            10615.65                  2.717606
 8192       2000             10868.54            10857.09                  1.389708
 16384      2000             10914.73            10912.28                  0.698386
 32768      2000             11009.99            11009.29                  0.352297
 65536      2000             10965.50            10751.23                  0.172020
 131072     2000             11027.72            11027.69                  0.088222
 262144     2000             11036.82            11036.75                  0.044147
 524288     2000             11021.59            10983.65                  0.021967
 1048576    2000             11038.09            11038.08                  0.011038
 2097152    2000             11035.63            11026.87                  0.005513
 4194304    2000             11040.46            11040.45                  0.002760
 8388608    2000             11039.53            11038.55                  0.001380
---------------------------------------------------------------------------------------
wuyujiji commented 4 years ago

I paste the output of ifconfig, ibdev2netdev and ibv_devinfo:

ifconfig:

br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.137.144.13  netmask 255.255.255.0  broadcast 10.137.144.255
        ether ec:0d:9a:ab:54:0a  txqueuelen 1000  (Ethernet)
        RX packets 83975381  bytes 2792479211029 (2.5 TiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 91259664  bytes 2567129665617 (2.3 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 0.0.0.0
        ether 02:42:ab:ca:63:58  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether ec:0d:9a:ab:54:0a  txqueuelen 1000  (Ethernet)
        RX packets 1956575207  bytes 2914010407720 (2.6 TiB)
        RX errors 0  dropped 102252  overruns 0  frame 0
        TX packets 1817041401  bytes 2680954427903 (2.4 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 30334953  bytes 30205847277 (28.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 30334953  bytes 30205847277 (28.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ibdev2netdev:

mlx5_0 port 1 ==> eth0 (Up)
mlx5_1 port 1 ==> eth1 (Down)

ibv_devinfo:

libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file: No such file or directory
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.27.2008
        node_guid:                      ec0d:9a03:00ab:540a
        sys_image_guid:                 ec0d:9a03:00ab:540a
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000012
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
wuyujiji commented 4 years ago

Here is my running script of scheduler and sever: scheduler:

export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1

# the RDMA interface name of the scheduler
export DMLC_ENABLE_RDMA=1
export DMLC_INTERFACE=eth0

export DMLC_PS_ROOT_URI=10.137.144.13 # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port

bpslaunch

server:

export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1

# the RDMA interface name of the scheduler
export DMLC_ENABLE_RDMA=1
export DMLC_INTERFACE=eth0

export DMLC_PS_ROOT_URI=10.137.144.13 # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port

bpslaunch
ymjiang commented 4 years ago

Seems that your eth0 does not have an available ip. You can either set DMLC_INTERFACE=br0, or manually add DMLC_NODE_HOST=10.137.144.13.

wuyujiji commented 4 years ago

Thanks a lot ,this problem is solved. But I met another problem:Check failed: mr ibv_reg_mr failed: Cannot allocate memory. According to https://github.com/bytedance/byteps/issues/282 and https://github.com/bytedance/byteps/issues/216, this bug seems to fix and merge into master branch. I build byteps withpip install and the version is 0.2.4. Does this version merges this PR?

ymjiang commented 4 years ago

v0.2.4 does not contain the fix, sorry. We will release v0.2.5 on pypi in a few days.

You can also install from the source code using v0.2.5.

wuyujiji commented 4 years ago

It means the v0.2.5 source has fixed this issue? can I directly build v0.2.5 from source without any change?

ymjiang commented 4 years ago

I just checked and you still need to change the source code a little bit. The correct process is: Pull byteps v0.2.5, and change the ps-lite submodule to 7e4800fe, then compile byteps using python3 setup.py install.

Apologize for the inconvenience. We will fix this in https://github.com/bytedance/byteps/pull/316.

wuyujiji commented 4 years ago

OK, thank you very much, I will try it soon.

wuyujiji commented 4 years ago

I just checked and you still need to change the source code a little bit. The correct process is: Pull byteps v0.2.5, and change the ps-lite submodule to 7e4800fe, then compile byteps using python3 setup.py install.

Apologize for the inconvenience. We will fix this in #316.

Hello, based on your suggesstion, I checkout the commit 7e4800f and the problem also occurs:

Here is my build command:

git clone --recursive --branch v0.2.5 --single-branch --depth 1 https://github.com/bytedance/byteps.git
cd byteps/3rdparty/ps-lite
git checkout 7e4800fe
cd ../../ && source scl_source enable devtoolset-4
BYTEPS_NCCL_LINK=shared BYTEPS_USE_RDMA=1 BYTEPS_WITHOUT_MXNET=1 python3 setup.py install

the error is:

[15:59:19] 3rdparty/ps-lite/include/dmlc/logging.h:276: [15:59:19] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr f
ailed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 7 entries:
[bt] (0) /usr/local/lib64/python3.6/site-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64
-linux-gnu.so(+0x2406b) [0x7ff7385b606b]
[bt] (1) /usr/local/lib64/python3.6/site-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64
-linux-gnu.so(+0x249a9) [0x7ff7385b69a9]
[bt] (2) /usr/local/lib64/python3.6/site-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64
-linux-gnu.so(+0x70e48) [0x7ff738602e48]
[bt] (3) /usr/local/lib64/python3.6/site-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64
-linux-gnu.so(+0x71c9b) [0x7ff738603c9b]
[bt] (4) /usr/lib64/libstdc++.so.6(+0xc8421) [0x7ff737ea9421]
[bt] (5) /usr/lib64/libpthread.so.0(+0x7e65) [0x7ff739f67e65]
[bt] (6) /usr/lib64/libc.so.6(clone+0x6d) [0x7ff73958788d]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [15:59:19] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
wuyujiji commented 4 years ago

the output of ulimit -l is 131072

ymjiang commented 4 years ago

Have you tried tuning the value of BYTEPS_RDMA_RX_DEPTH and BYTEPS_RDMA_START_DEPTH?

wuyujiji commented 4 years ago

No,Did you PR not solve this problem? How did I set the value of BYTEPS_RDMA_RX_DEPTH and BYTEPS_RDMA_START_DEPTH? Whether using export in terminal and then rerunning, or set this two value and then recompiling?

ymjiang commented 4 years ago

The PR only makes the value configurable.

Whether using export in terminal and then rerunning, or set this two value and then recompiling?

No need to recompile. Just export the value and then rerun.

wuyujiji commented 4 years ago

Need I only set the value of BYTEPS_RDMA_RX_DEPTH and BYTEPS_RDMA_START_DEPTH in scheduler, or the worker and server also need to set? How much the value is proper?

wuyujiji commented 4 years ago

when I export BYTEPS_RDMA_RX_DEPTH=1024 and export BYTEPS_RDMA_START_DEPTH=64 in scheduler, server and two workers, the scheduler is normal, but the server occurs the errors:

[17:30:56] src/./rdma_van.h:230: Connect to Node 1 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 1 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 11 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 9 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 8 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 9 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 8 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 11 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 1 with Transport=RDMA
[17:31:13] 3rdparty/ps-lite/include/dmlc/logging.h:276: [17:31:13] src/./rdma_utils.h:119: Check failed: mr = ibv_reg_mr(pd_
, p, size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE) Failed to register the memory region: Cannot allocate memory, sa.size()=2359296
wuyujiji commented 4 years ago

according to https://github.com/bytedance/byteps/issues/282#issuecomment-669082744, his problem is sloved by change the sequence of running scheduler, worker and server. I want to know the different execution sequence can cause the error of RDMA registers memory region? Furthermore, what is the correct execution order?

ymjiang commented 4 years ago

when I export BYTEPS_RDMA_RX_DEPTH=1024 and export BYTEPS_RDMA_START_DEPTH=64 in scheduler, server and two workers, the scheduler is normal, but the server occurs the errors:

[17:30:56] src/./rdma_van.h:230: Connect to Node 1 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 1 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 11 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 9 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 8 with Transport=RDMA
[17:31:12] src/./rdma_van.h:831: OnConnect to Node 9 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 8 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 11 with Transport=RDMA
[17:31:12] src/./rdma_van.h:230: Connect to Node 1 with Transport=RDMA
[17:31:13] 3rdparty/ps-lite/include/dmlc/logging.h:276: [17:31:13] src/./rdma_utils.h:119: Check failed: mr = ibv_reg_mr(pd_
, p, size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE) Failed to register the memory region: Cannot allocate memory, sa.size()=2359296

This is caused by not having enough resources for registering the memory buffers.

Here are a few things to try: (duplicate of https://github.com/bytedance/byteps/issues/216#issuecomment-596891713)

wuyujiji commented 4 years ago

I tried these two ways and the problem is still not solved. In https://www.rdmamojo.com/2012/09/07/ibv_reg_mr/, it shows that the another possible for register MR failed is no permission write (official: read only memory cannot be registered with write permissions (either local or remote)). My docker is runned without root permission, I don't know whether the write permission is caused by this error.

ymjiang commented 4 years ago

I tried these two ways and the problem is still not solved. In https://www.rdmamojo.com/2012/09/07/ibv_reg_mr/, it shows that the another possible for register MR failed is no permission write (official: read only memory cannot be registered with write permissions (either local or remote)). My docker is runned without root permission, I don't know whether the write permission is caused by this error.

Can you run this benchmark? https://github.com/bytedance/ps-lite#1-basic-benchmark

If it works, then the problem is not related to the permission.

wuyujiji commented 4 years ago

the output of tests/test_benchmark:

[15:38:55] src/postoffice.cc:25: Creating Van: ibverbs
[15:38:55] src/./rdma_van.h:44: Shared memory IPC has been disabled
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No such file or directory
[15:38:56] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[15:38:56] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[15:38:56] src/./rdma_van.h:806: OnConnect to Node 8 with Transport=RDMA
[15:38:56] src/./rdma_van.h:806: OnConnect to Node 9 with Transport=RDMA
[15:38:56] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[15:38:56] src/./rdma_van.h:234: Connect to Node 8 with Transport=RDMA
[15:38:56] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[15:38:56] tests/test_benchmark.cc:177: 1 servers in total
[15:38:56] tests/test_benchmark.cc:111: ========= PUSH_PULL mode =========
[15:38:56] tests/test_benchmark.cc:112: ========= msg_size=1024000 bytes =========
[15:38:56] tests/test_benchmark.cc:164: Application goodput: 73.2216 Gbps. count = 10

the output of tests/test_ipc_benchmark:

[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 87.1557 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.3595 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.4327 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.291 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.8608 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.6724 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.9319 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.9671 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.0182 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.1639 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 88.991 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.6009 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.349 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.1734 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 85.3423 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 89.2997 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 81.7764 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 83.8365 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 83.5674 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 83.6138 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 83.904 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 83.3367 Gbps
[15:44:30] tests/test_ipc_benchmark.cc:136: Application goodput: 82.7374 Gbps

it seems works well

wuyujiji commented 4 years ago

hello, the above error is solved by increase the value of ulimit -l. However, I met the next problem which is similar to https://github.com/bytedance/byteps/issues/282#issuecomment-669636652. According to his solution, I checked my PFC config and confirmed it's enable. Therefore, I don't know the reason my performance is unstable. Here is my ipc-benchmark output for LOG_DURATION=100:

[18:27:09] src/./rdma_van.h:234: Connect to Node 11 with Transport=RDMA
[18:27:09] src/./rdma_van.h:234: Connect to Node 10 with Transport=RDMA
[18:27:09] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[18:27:09] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[18:27:09] tests/test_ipc_benchmark.cc:174: 2 servers in total
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=0, name=BytePS_ShM_0
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=65536, name=BytePS_ShM_65536
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=131072, name=BytePS_ShM_131072
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=196608, name=BytePS_ShM_196608
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=262144, name=BytePS_ShM_262144
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=327680, name=BytePS_ShM_327680
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=393216, name=BytePS_ShM_393216
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=458752, name=BytePS_ShM_458752
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=524288, name=BytePS_ShM_524288
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=589824, name=BytePS_ShM_589824
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=655360, name=BytePS_ShM_655360
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=720896, name=BytePS_ShM_720896
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=786432, name=BytePS_ShM_786432
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=851968, name=BytePS_ShM_851968
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=917504, name=BytePS_ShM_917504
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=983040, name=BytePS_ShM_983040
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=1048576, name=BytePS_ShM_1048576
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=1114112, name=BytePS_ShM_1114112
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=1179648, name=BytePS_ShM_1179648
[18:27:09] tests/test_ipc_benchmark.cc:34: initialized share memory size=1024000 for key=1245184, name=BytePS_ShM_1245184
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 81.3902 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 82.7188 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 106.109 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 155.804 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 155.168 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 156.249 Gbps
[18:27:10] tests/test_ipc_benchmark.cc:136: Application goodput: 155.543 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 154.531 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 154.711 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 140.62 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 113.722 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 125.809 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 142.547 Gbps
[18:27:11] tests/test_ipc_benchmark.cc:136: Application goodput: 153.604 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 155.389 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 155.288 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 153.965 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 148.493 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 153.369 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 150.847 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 130.986 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 124.084 Gbps
[18:27:12] tests/test_ipc_benchmark.cc:136: Application goodput: 138.677 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 142.156 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 135.113 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 153.085 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 154.343 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 153.031 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 134.72 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 121.617 Gbps
[18:27:13] tests/test_ipc_benchmark.cc:136: Application goodput: 136.824 Gbps
[18:27:14] tests/test_ipc_benchmark.cc:136: Application goodput: 118.423 Gbps
[18:27:14] tests/test_ipc_benchmark.cc:136: Application goodput: 130.066 Gbps
[18:27:14] tests/test_ipc_benchmark.cc:136: Application goodput: 149.571 Gbps
[18:27:14] tests/test_ipc_benchmark.cc:136: Application goodput: 155.2 Gbps
[18:27:14] tests/test_ipc_benchmark.cc:136: Application goodput: 153.134 Gbps
[18:27:18] tests/test_ipc_benchmark.cc:136: Application goodput: 4.19001 Gbps
[18:27:18] tests/test_ipc_benchmark.cc:136: Application goodput: 82.6032 Gbps
[18:27:18] tests/test_ipc_benchmark.cc:136: Application goodput: 82.622 Gbps
[18:27:18] tests/test_ipc_benchmark.cc:136: Application goodput: 83.2497 Gbps
[18:27:22] tests/test_ipc_benchmark.cc:136: Application goodput: 4.37797 Gbps
[18:27:22] tests/test_ipc_benchmark.cc:136: Application goodput: 80.2225 Gbps
[18:27:26] tests/test_ipc_benchmark.cc:136: Application goodput: 4.0499 Gbps
[18:27:27] tests/test_ipc_benchmark.cc:136: Application goodput: 81.4794 Gbps
ymjiang commented 4 years ago

I checked my PFC config and confirmed it's enable

~How did you confirm it? From the log of ib_send_bw you posted above, the bandwidth looks quite low (10MB/s). I am confused by the results since they are inconsistent with your test_ipc_benchmark.~

(edited due to misread)

PS: You can use the test_benchmark to test the 1v1 RDMA performance.

bobzhuyb commented 4 years ago

I checked my PFC config and confirmed it's enable

How did you confirm it? From the log of ib_send_bw you posted above, the bandwidth looks quite low (~10MB/s). I am confused by the results since they are inconsistent with your test_ipc_benchmark.

PS: You can use the test_benchmark to test the 1v1 RDMA performance.

You misread the ib_send_bw bandwidth unit. The performance is expected.

bobzhuyb commented 4 years ago

@wuyujiji You can check the counters in this folder /sys/class/infiniband/mlx5_1/ports/1/hw_counters/ (or change mlx5_1 to other IB device according to your setup). out_of_sequence, rp_cnp_handled, np_ecn_marked_roce_packets etc. should give you a good understanding of the network level details. Those counters should not grow when you run 1-to-1 traffic.

wuyujiji commented 4 years ago

@bobzhuyb Hi, I am not familiar with RDMA, when I test 1-to-1 traffic (test_banchmark.cc), the program finished quckily, the output is:

[10:50:51] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[10:50:51] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[10:50:51] src/./rdma_van.h:234: Connect to Node 8 with Transport=RDMA
[10:50:51] src/./rdma_van.h:806: OnConnect to Node 9 with Transport=RDMA
[10:50:51] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[10:50:51] src/./rdma_van.h:806: OnConnect to Node 8 with Transport=RDMA
[10:50:51] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[10:50:51] tests/test_benchmark.cc:177: 1 servers in total
[10:50:51] tests/test_benchmark.cc:111: ========= PUSH_PULL mode =========
[10:50:51] tests/test_benchmark.cc:112: ========= msg_size=1024000 bytes =========
[10:50:51] tests/test_benchmark.cc:164: Application goodput: 76.5938 Gbps. count = 10

when checking out_of_sequence, rp_cnp_handled, np_ecn_marked_roce_packets, and np_cnp_sent, all value have no increasing, I don't konw whether the program executes shortly to have no increased.

In addtion, when I run the test_ipc_banchmark.cc for about five minutes, In out_of_sequence, rp_cnp_handled, np_ecn_marked_roce_packets, and np_cnp_sent , only out_of_sequence and rp_cnp_handled increase.

wuyujiji commented 4 years ago

I did another experiment, when reducing to 1worker and 1server in one machine and running the test_ipc_banchmark.cc, the out_of_sequence and rp_cnp_handled have never change.

bobzhuyb commented 4 years ago

out_of_sequence growing means there is packet drop, so probably PFC is not enabled. You should ask your system admin about PFC configuration.. You may be able to check the configuration by mlnx_qos -i eth0 (or other interfaces)

wuyujiji commented 4 years ago

my system admin checks that the PFC config is enable. the output of mlnx_qos -i eth0 is:

DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
        prio:0 dscp:07,06,05,04,03,02,01,00,
        prio:1 dscp:15,14,13,12,11,10,09,08,
        prio:2 dscp:23,22,21,20,19,18,17,16,
        prio:3 dscp:31,30,29,28,27,26,25,24,
        prio:4 dscp:39,38,37,36,35,34,33,32,
        prio:5 dscp:47,46,45,44,43,42,41,40,
        prio:6 dscp:55,54,53,52,51,50,49,48,
        prio:7 dscp:63,62,61,60,59,58,57,56,
Receive buffer size (bytes): 130944,130944,0,0,0,0,0,0,
Cable len: 7
PFC configuration:
        priority    0   1   2   3   4   5   6   7
        enabled     0   0   0   1   0   0   0   0
        buffer      0   0   0   1   0   0   0   0
tc: 0 ratelimit: unlimited, tsa: vendor
         priority:  1
tc: 1 ratelimit: unlimited, tsa: vendor
         priority:  0
tc: 2 ratelimit: unlimited, tsa: vendor
         priority:  2
tc: 3 ratelimit: unlimited, tsa: vendor
         priority:  3
tc: 4 ratelimit: unlimited, tsa: vendor
         priority:  4
tc: 5 ratelimit: unlimited, tsa: vendor
         priority:  5
tc: 6 ratelimit: unlimited, tsa: vendor
         priority:  6
tc: 7 ratelimit: unlimited, tsa: vendor
         priority:  7
You have new mail in /var/spool/mail/root

I am sorry that I don't know whether PFC config is enable. Could you please help me check this? thanks a lot!

wuyujiji commented 4 years ago

@bobzhuyb @ymjiang hello,did you have the random number experiment result of byteps/example/pytorch/benchmark_byteps.py,I want to check my corresponding results