Open MichaelHsu170 opened 3 years ago
How much is the bandwidth between these two nodes?
They are 200Gib nic cards on both nodes. I got averaged speed at around 450Mb/s per my measurement with iftop (sudo iftop -n
).
By the way, are there any recommended configurations to run data parallel training with VGG16 on 2 nodes? For example, how many workers should we start, how many servers should we start? Do we need to have a separate machine as the scheduler?
2. for scenario 2: 3.4 img/sec per GPU
and
scenario 2:
Model: vgg16 Batch size: 32 Number of GPUs: 16 Running warmup... Running benchmark... 300 img/sec per GPU
So what exactly is the performance for scenario 2?
Here are a few tips of recommended configs. https://github.com/bytedance/byteps/blob/master/docs/best-practice.md
And your throughput is less than 0.5Gbps, this is not expected with 200Gbps NICs. Can you use this benchmark to test the networking performance? https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark
Hi @ymjiang , Happy Chinese New Year!!! Sorry, for scenario 2 it is 3.4 img/sec. I'll try the benchmark tool for networking performance measurement. Thank you.
Hi @ymjiang , We tried the basic benchmark mentioned in https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark, but got failures. Could you suggest how can we get it working? Thank you. We ran 2 scenarios:
ib_send_bw
works correctly on both machines.$ ibdev2netdev mlx5_0 port 1 ==> ib0 (Up) mlx5_1 port 1 ==> ib1 (Up) mlx5_2 port 1 ==> ib2 (Up) mlx5_3 port 1 ==> ib3 (Up) mlx5_4 port 1 ==> ib4 (Up) mlx5_5 port 1 ==> ib5 (Up) mlx5_6 port 1 ==> ib6 (Up) mlx5_7 port 1 ==> ib7 (Up) mlx5_8 port 1 ==> enp225s0f0 (Down) mlx5_9 port 1 ==> enp225s0f1 (Down)
$ ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cd3e sys_image_guid: 0c42:a103:000b:cd3e vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 7 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cbc6 sys_image_guid: 0c42:a103:000b:cbc6 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 17 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_2 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cc1e sys_image_guid: 0c42:a103:000b:cc1e vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 5 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_3 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:024c sys_image_guid: 0c42:a103:000c:024c vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 8 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_4 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cbc2 sys_image_guid: 0c42:a103:000b:cbc2 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 16 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_5 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cd2a sys_image_guid: 0c42:a103:000b:cd2a vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 6 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_6 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:0478 sys_image_guid: 0c42:a103:000c:0478 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 11 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_7 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:0488 sys_image_guid: 0c42:a103:000c:0488 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 12 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_8 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000a:37da sys_image_guid: 0c42:a103:000a:37da vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000225 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_9 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000a:37db sys_image_guid: 0c42:a103:000a:37da vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000225 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet
Can you show the output of ulimit -l
?
It shown unlimited. `$ ulimit -l
unlimited`
There was a similar issue before: https://github.com/bytedance/byteps/issues/282. Can you try this setup: 1 scheduler + 2 servers + 2 workers? It may have better load balance than using one server.
I tried this scenario on 2 machines: machine A: scheduler, server, worker machine B: server, worker
But still processes on machine B crashed with error message what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
. Reducing BYTEPS_RDMA_START_DEPTH
and BYTEPS_RDMA_RX_DEPTH
yields the same error.
The ticket you mentioned seems to be related to PFC. Do you think this error is possibly caused by disabled PFC functionality?
PFC is not related to this problem. However, I am not sure about the possible reasons. Perhaps some hardware configurations on your machines are limited. But I have no idea now.
Does using 1 worker and 1 server works?
If scheduler, 1 server and 1 worker run on the same machine, scheduer crashed with terminate called after throwing an instance of 'dmlc::Error' what(): [09:45:09] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR)
error.
Run them on 2 machines:
machine A: scheduler, server
machine B: worker
worker on machine B crashed with what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
error.
Hi @ymjiang , any recommendation will be grateful. Thank you.
Would you check these similar issues -- https://github.com/bytedance/byteps/issues/371 and https://github.com/bytedance/byteps/issues/372?
Describe the bug I've tried the following 2 scenarios, and compared their performances.
To Reproduce Steps to reproduce the behavior:
#!/bin/bash export NVIDIA_VISISBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_ROLE=worker export DMLC_NUM_WORKER=1 export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch python3 ./example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 20
./run_worker.sh
on 1 node#!/bin/bash export DMLC_ROLE=scheduler export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch
run_server.sh
#!/bin/bash export DMLC_ROLE=server export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch
run_worker.sh
#!/bin/bash export NVIDIA_VISISBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_ROLE=worker export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch python3 ./example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 20
./run_scheduler.sh
,./run_server.sh
and./run_worker.sh
on 1 node, and then run./run_server.sh
and./run_worker.sh
on another node.Model: vgg16 Batch size: 32 Number of GPUs: 8 Running warmup... Running benchmark... 300 img/sec per GPU
scenario 2:Model: vgg16 Batch size: 32 Number of GPUs: 16 Running warmup... Running benchmark... 3.4 img/sec per GPU
Expected behavior No such big performance gap
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context Add any other context about the problem here.