Performance regression with multi-node running

MichaelHsu170 commented 3 years ago

Describe the bug I've tried the following 2 scenarios, and compared their performances.

Run VGG16 on 1 single node with 8 GPUs with pbslaunch.
Run VGG16 on 2 nodes with 8 GPUs each with pbslaunch. Performance regressed a lot with scenario 2 (1/100 of scenario 1) for scenario 1: 300 img/sec per GPU for scenario 2: 3.4 img/sec per GPU

To Reproduce Steps to reproduce the behavior:

git clone https://github.com/bytedance/byteps.git
python3 setup.py install
dpkg -i nccl-local-repo-ubuntu1804-2.8.4-cuda11.0_1.0-1_amd64.deb
apt install libnccl2 libnccl-dev
Prepare running scripts for scenario 1: run_worker.sh #!/bin/bash export NVIDIA_VISISBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_ROLE=worker export DMLC_NUM_WORKER=1 export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch python3 ./example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 20
Run ./run_worker.sh on 1 node
Prepare running scripts for scenario 2: run_scheduler.sh #!/bin/bash export DMLC_ROLE=scheduler export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch

run_server.sh #!/bin/bash export DMLC_ROLE=server export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch

run_worker.sh #!/bin/bash export NVIDIA_VISISBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_ROLE=worker export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch python3 ./example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 20

Run ./run_scheduler.sh, ./run_server.sh and ./run_worker.sh on 1 node, and then run ./run_server.sh and ./run_worker.sh on another node.
Performance: scenario 1: Model: vgg16 Batch size: 32 Number of GPUs: 8 Running warmup... Running benchmark... 300 img/sec per GPU scenario 2: Model: vgg16 Batch size: 32 Number of GPUs: 16 Running warmup... Running benchmark... 3.4 img/sec per GPU

Expected behavior No such big performance gap

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: Ubuntu 18.04
GCC version: 7.5
CUDA and NCCL version: 11.0, 2.8.4
Framework (TF, PyTorch, MXNet): PyTorch 1.7.1

Additional context Add any other context about the problem here.

ymjiang commented 3 years ago

How much is the bandwidth between these two nodes?

MichaelHsu170 commented 3 years ago

They are 200Gib nic cards on both nodes. I got averaged speed at around 450Mb/s per my measurement with iftop (sudo iftop -n).

MichaelHsu170 commented 3 years ago

By the way, are there any recommended configurations to run data parallel training with VGG16 on 2 nodes? For example, how many workers should we start, how many servers should we start? Do we need to have a separate machine as the scheduler?

ymjiang commented 3 years ago

I am confused by your log. You mentioned

2. for scenario 2: 3.4 img/sec per GPU

and

scenario 2: Model: vgg16 Batch size: 32 Number of GPUs: 16 Running warmup... Running benchmark... 300 img/sec per GPU

So what exactly is the performance for scenario 2?

Here are a few tips of recommended configs. https://github.com/bytedance/byteps/blob/master/docs/best-practice.md
And your throughput is less than 0.5Gbps, this is not expected with 200Gbps NICs. Can you use this benchmark to test the networking performance? https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark

MichaelHsu170 commented 3 years ago

Hi @ymjiang , Happy Chinese New Year!!! Sorry, for scenario 2 it is 3.4 img/sec. I'll try the benchmark tool for networking performance measurement. Thank you.

MichaelHsu170 commented 3 years ago

Hi @ymjiang , We tried the basic benchmark mentioned in https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark, but got failures. Could you suggest how can we get it working? Thank you. We ran 2 scenarios:

1 scheduler, 1 server and 2 workers on 2 machines. On machine A, 1 scheduler and 1 worker were executed. On machine B, 1 server and 1 worker were executed. This scenario failed with " what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)". We tried to set these 2 environment variables to even single-digit numbers, but always this error shown up.
1 scheduler, 1 server and 2 workers were run on a single machine. The scheduler crashed with error " what(): [08:43:06] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR)" at the moment when the last client (server or worker) got launched.
- ib_send_bw works correctly on both machines.
- We used the IP address from ib0 port as scheduler address. $ ibdev2netdev mlx5_0 port 1 ==> ib0 (Up) mlx5_1 port 1 ==> ib1 (Up) mlx5_2 port 1 ==> ib2 (Up) mlx5_3 port 1 ==> ib3 (Up) mlx5_4 port 1 ==> ib4 (Up) mlx5_5 port 1 ==> ib5 (Up) mlx5_6 port 1 ==> ib6 (Up) mlx5_7 port 1 ==> ib7 (Up) mlx5_8 port 1 ==> enp225s0f0 (Down) mlx5_9 port 1 ==> enp225s0f1 (Down) $ ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cd3e sys_image_guid: 0c42:a103:000b:cd3e vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 7 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cbc6 sys_image_guid: 0c42:a103:000b:cbc6 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 17 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_2 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cc1e sys_image_guid: 0c42:a103:000b:cc1e vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 5 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_3 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:024c sys_image_guid: 0c42:a103:000c:024c vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 8 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_4 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cbc2 sys_image_guid: 0c42:a103:000b:cbc2 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 16 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_5 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cd2a sys_image_guid: 0c42:a103:000b:cd2a vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 6 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_6 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:0478 sys_image_guid: 0c42:a103:000c:0478 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 11 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_7 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:0488 sys_image_guid: 0c42:a103:000c:0488 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 12 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_8 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000a:37da sys_image_guid: 0c42:a103:000a:37da vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000225 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_9 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000a:37db sys_image_guid: 0c42:a103:000a:37da vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000225 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet

ymjiang commented 3 years ago

Can you show the output of ulimit -l?

MichaelHsu170 commented 3 years ago

It shown unlimited. `$ ulimit -l

unlimited`

ymjiang commented 3 years ago

There was a similar issue before: https://github.com/bytedance/byteps/issues/282. Can you try this setup: 1 scheduler + 2 servers + 2 workers? It may have better load balance than using one server.

MichaelHsu170 commented 3 years ago

I tried this scenario on 2 machines: machine A: scheduler, server, worker machine B: server, worker

But still processes on machine B crashed with error message what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048). Reducing BYTEPS_RDMA_START_DEPTH and BYTEPS_RDMA_RX_DEPTH yields the same error. The ticket you mentioned seems to be related to PFC. Do you think this error is possibly caused by disabled PFC functionality?

ymjiang commented 3 years ago

PFC is not related to this problem. However, I am not sure about the possible reasons. Perhaps some hardware configurations on your machines are limited. But I have no idea now.

Does using 1 worker and 1 server works?

MichaelHsu170 commented 3 years ago

If scheduler, 1 server and 1 worker run on the same machine, scheduer crashed with terminate called after throwing an instance of 'dmlc::Error' what(): [09:45:09] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR) error. Run them on 2 machines: machine A: scheduler, server machine B: worker worker on machine B crashed with what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048) error.

MichaelHsu170 commented 3 years ago

Hi @ymjiang , any recommendation will be grateful. Thank you.

ymjiang commented 3 years ago

Would you check these similar issues -- https://github.com/bytedance/byteps/issues/371 and https://github.com/bytedance/byteps/issues/372?

bytedance / byteps

Performance regression with multi-node running #365