bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.62k stars 487 forks source link

Error when running on multiple machines #350

Closed Rivendile closed 3 years ago

Rivendile commented 3 years ago

Hi, when I try to run BytePS mxnet on 4 workers and 4 servers using the code https://github.com/byteps/examples/blob/master/mxnet/vgg-16/train_imagenet.py, everything is fine. However, when I try to run on 8 workers and 8 servers, the process is stuck after printing [06:26:10] src/postoffice.cc:25: Creating Van: zmq [06:26:10] src/./zmq_van.h:299: Start ZMQ recv thread Any suggestions?

Rivendile commented 3 years ago

Besides, the process does not stop after printing [07:31:38] byteps/server/server.cc:512: byteps has been shutdown.

Rivendile commented 3 years ago

The 1st problem has been solved. It is because the wrong setting for 8 workers.

ymjiang commented 3 years ago

Besides, the process does not stop after printing [07:31:38] byteps/server/server.cc:512: byteps has been shutdown.

Which commit or PYPI version did you use?

Rivendile commented 3 years ago

I use the master branch and provided dockerfile.

ymjiang commented 3 years ago

I have confirmed that b8948f092796fa177c56a2a1595188dc7cce9999 does not have the problem.

Can you set PS_VERBOSE=2 for the scheduler process and paste the complete output here?

Rivendile commented 3 years ago

The output of the master branch is as follows:

BytePS launching scheduler Command: python3 -c 'import byteps.server'

[00:56:31] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [00:56:31] byteps/server/server.cc:438: Enable engine scheduling for BytePS server [00:56:31] src/postoffice.cc:25: Creating Van: zmq [00:56:31] src/./zmq_van.h:66: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [00:56:31] src/./zmq_van.h:71: BYTEPS_ZMQ_NTHREADS set to 4 [00:56:31] src/van.cc:[441: Bind to [role=scheduler, id=1, ip=172.31.29.146, port=8000, is_recovery=0, aux_id=-1] 00:56:31] src/./zmq_van.h:299: Start ZMQ recv thread [00:56:31] src/van.cc:555: zeromq received: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, ip=172.31.29.146, port=38793, is_recovery=0, aux_id=-1] } }. THIS IS NOT DATA MSG! [00:56:31] src/van.cc:555: zeromq received: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, ip=172.31.24.253, port=57317, is_recovery=0, aux_id=-1] } }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:555: zeromq received: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=worker, ip=172.31.29.146, port=51597, is_recovery=0, aux_id=-1] } }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:555: zeromq received: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=worker, ip=172.31.24.253, port=38537, is_recovery=0, aux_id=-1] } }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:189: assign rank=8 to node [role=server, ip=172.31.29.146, port=38793, is_recovery=0, aux_id=-1] [00:56:34] src/van.cc:189: assign rank=9 to node [role=worker, ip=172.31.29.146, port=51597, is_recovery=0, aux_id=-1] [00:56:34] src/van.cc:189: assign rank=11 to node [role=worker, ip=172.31.24.253, port=38537, is_recovery=0, aux_id=-1] [00:56:34] src/van.cc:189: assign rank=10 to node [role=server, ip=172.31.24.253, port=57317, is_recovery=0, aux_id=-1] [00:56:34] src/van.cc:530: zeromq sent: ? => 9. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=172.31.29.146, port=38793, is_recovery=0, aux_id=-1] [role=worker, id=9, ip=172.31.29.146, port=51597, is_recovery=0, aux_id=-1] [role=worker, id=11, ip=172.31.24.253, port=38537, is_recovery=0, aux_id=-1] [role=server, id=10, ip=172.31.24.253, port=57317, is_recovery=0, aux_id=-1] [role=scheduler, id=1, ip=172.31.29.146, port=8000, is_recovery=0, aux_id=-1] } }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:530: zeromq sent: ? => 11. Meta: request=0, timestamp=1, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=172.31.29.146, port=38793, is_recovery=0, aux_id=-1] [role=worker, id=9, ip=172.31.29.146, port=51597, is_recovery=0, aux_id=-1] [role=worker, id=11, ip=172.31.24.253, port=38537, is_recovery=0, aux_id=-1] [role=server, id=10, ip=172.31.24.253, port=57317, is_recovery=0, aux_id=-1] [role=scheduler, id=1, ip=172.31.29.146, port=8000, is_recovery=0, aux_id=-1] } }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:530: zeromq sent: ? => 8. Meta: request=0, timestamp=2, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=172.31.29.146, port=38793, is_recovery=0, aux_id=-1] [role=worker, id=9, ip=172.31.29.146, port=51597, is_recovery=0, aux_id=-1] [role=worker, id=11, ip=172.31.24.253, port=38537, is_recovery=0, aux_id=-1] [role=server, id=10, ip=172.31.24.253, port=57317, is_recovery=0, aux_id=-1] [role=scheduler, id=1, ip=172.31.29.146, port=8000, is_recovery=0, aux_id=-1] } }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:530: zeromq sent: ? => 10. Meta: request=0, timestamp=3, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=172.31.29.146, port=38793, is_recovery=0, aux_id=-1] [role=worker, id=9, ip=172.31.29.146, port=51597, is_recovery=0, aux_id=-1] [role=worker, id=11, ip=172.31.24.253, port=38537, is_recovery=0, aux_id=-1] [role=server, id=10, ip=172.31.24.253, port=57317, is_recovery=0, aux_id=-1] [role=scheduler, id=1, ip=172.31.29.146, port=8000, is_recovery=0, aux_id=-1] } }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:215: the scheduler is connected to 2 workers and 2 servers [00:56:34] src/van.cc:555: zeromq received: 8 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:314: Barrier count for 7 : 1 [00:56:34] src/van.cc:555: zeromq received: 10 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:314: Barrier count for 7 : 2 [00:56:34] src/van.cc:555: zeromq received: 9 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:314: Barrier count for 7 : 3 [00:56:34] src/van.cc:530: zeromq sent: ? => 1. Meta: request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:555: zeromq received: 1 => 1. Meta: request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:314: Barrier count for 7 : 4 [00:56:34] src/van.cc:555: zeromq received: 11 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:314: Barrier count for 7 : 5 [00:56:34] src/van.cc:530: zeromq sent: ? => 9. Meta: request=0, timestamp=5, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:530: zeromq sent: ? => 11. Meta: request=0, timestamp=6, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:530: zeromq sent: ? => 8. Meta: request=0, timestamp=7, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:530: zeromq sent: ? => 10. Meta: request=0, timestamp=8, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:530: zeromq sent: ? => 1. Meta: request=0, timestamp=9, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:555: zeromq received: 1 => 1. Meta: request=0, timestamp=9, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:555[00:56:34] src/van.cc:530: zeromq sent: : zeromq received: ? => 1. Meta: request=1, timestamp=10, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! 8 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:314: Barrier count for 7 : 1 [00:56:34] src/van.cc:555: zeromq received: 1 => 1. Meta: request=1, timestamp=10, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:314: Barrier count for 7 : 2 [00:56:34] src/van.cc:555: zeromq received: 10 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:56:34] src/van.cc:314: Barrier count for 7 : 3 [00:57:05] src/van.cc:555: zeromq received: 9 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:57:05] src/van.cc:314: Barrier count for 7 : 4 [00:57:05] src/van.cc:555: zeromq received: 11 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG! [00:57:05] src/van.cc:314: Barrier count for 7 : 5 [00:57:05] src/van.cc:530: zeromq sent: ? => 9. Meta: request=0, timestamp=11, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:57:05] src/van.cc:530: zeromq sent: ? => 11. Meta: request=0, timestamp=12, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:57:05] src/van.cc:530: zeromq sent: ? => 8. Meta: request=0, timestamp=13, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:57:05] src/van.cc:530: zeromq sent: ? => 10. Meta: request=0, timestamp=14, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:57:05] src/van.cc:530: zeromq sent: ? => 1. Meta: request=0, timestamp=15, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:57:05] src/van.cc:555: zeromq received: 1 => 1. Meta: request=0, timestamp=15, control={ cmd=BARRIER, barrier_group=-1936234560 }. THIS IS NOT DATA MSG! [00:57:05] src/./zmq_van.h:76: Stopping H[1] [00:57:05] src/van.cc:555: zeromq received: 1 => 1. Meta: request=0, control={ cmd=TERMINATE }. THIS IS NOT DATA MSG! [00:57:05] src/van.cc:104: H[1] is stopped [00:57:05] src/./zmq_van.h:81: H all threads joined and destroyed [00:57:05] byteps/server/server.cc:512: byteps has been shutdown

Rivendile commented 3 years ago

Besides, when I run on a single machine with 8 GPUs, the throughput is 1505.6 img/sec, which is approximately the same with yours in performance.md. However, when I run on multi-machines, such as 16GPUs on 2 machines, I only get 1079.84 img/sec in all. I use the following command to run the code https://github.com/byteps/examples/blob/master/mxnet/vgg-16/train_imagenet.py

export dmlc_num_server=2 export dmlc_num_worker=2 export nvidia_visible_devices=0,1,2,3,4,5,6,7 export distributed_framework=byteps

export byteps_server_enable_schedule=0 export byteps_partition_bytes=4096000

export dmlc_ps_root_uri=ip1 export COMMAND='python3 /home/cluster/byteps/examples/mxnet/vgg-16/train_imagenet.py --network vgg --num-layers 16 --batch-size 32 --benchmark 1 --num-examples 3232 --num-epochs 1 --disp-batches 10'

scheduler

DMLC_ROLE=scheduler DMLC_NUM_SERVER=$dmlc_num_server DMLC_NUM_WORKER=$dmlc_num_worker NVIDIA_VISIBLE_DEVICES=$nvidia_visible_devices DISTRIBUTED_FRAMEWORK=$distributed_framework BYTEPS_SERVER_ENABLE_SCHEDULE=$byteps_server_enable_schedule BYTEPS_SCHEDULING_CREDIT=$byteps_scheduling_credit BYTEPS_PARTITION_BYTES=$byteps_partition_bytes DMLC_PS_ROOT_URI=$dmlc_ps_root_uri DMLC_PS_ROOT_PORT=8000 bpslaunch &

server

ssh cluster@ip1 "BYTEPS_TRACE_ON=$byteps_trace_on BYTEPS_TRACE_END_STEP=20 BYTEPS_TRACE_START_STEP=10 BYTEPS_TRACE_DIR=$byteps_trace_dir DMLC_ROLE=server DMLC_NUM_SERVER=$dmlc_num_server DMLC_NUM_WORKER=$dmlc_num_worker NVIDIA_VISIBLE_DEVICES=$nvidia_visible_devices DISTRIBUTED_FRAMEWORK=$distributed_framework BYTEPS_SERVER_ENABLE_SCHEDULE=$byteps_server_enable_schedule BYTEPS_SCHEDULING_CREDIT=$byteps_scheduling_credit BYTEPS_PARTITION_BYTES=$byteps_partition_bytes DMLC_PS_ROOT_URI=$dmlc_ps_root_uri DMLC_PS_ROOT_PORT=8000 bpslaunch &" & ssh cluster@ip2 "BYTEPS_TRACE_ON=$byteps_trace_on BYTEPS_TRACE_END_STEP=20 BYTEPS_TRACE_START_STEP=10 BYTEPS_TRACE_DIR=$byteps_trace_dir DMLC_ROLE=server DMLC_NUM_SERVER=$dmlc_num_server DMLC_NUM_WORKER=$dmlc_num_worker NVIDIA_VISIBLE_DEVICES=$nvidia_visible_devices DISTRIBUTED_FRAMEWORK=$distributed_framework BYTEPS_SERVER_ENABLE_SCHEDULE=$byteps_server_enable_schedule BYTEPS_SCHEDULING_CREDIT=$byteps_scheduling_credit BYTEPS_PARTITION_BYTES=$byteps_partition_bytes DMLC_PS_ROOT_URI=$dmlc_ps_root_uri DMLC_PS_ROOT_PORT=8000 bpslaunch &" &

worker

ssh cluster@ip3 "DMLC_WORKER_ID=0 DMLC_ROLE=worker DMLC_NUM_SERVER=$dmlc_num_server DMLC_NUM_WORKER=$dmlc_num_worker NVIDIA_VISIBLE_DEVICES=$nvidia_visible_devices DISTRIBUTED_FRAMEWORK=$distributed_framework BYTEPS_SERVER_ENABLE_SCHEDULE=$byteps_server_enable_schedule BYTEPS_SCHEDULING_CREDIT=$byteps_scheduling_credit BYTEPS_PARTITION_BYTES=$byteps_partition_bytes DMLC_PS_ROOT_URI=$dmlc_ps_root_uri DMLC_PS_ROOT_PORT=8000 bpslaunch $COMMAND& ssh cluster@ip4 "DMLC_WORKER_ID=1 DMLC_ROLE=worker DMLC_NUM_SERVER=$dmlc_num_server DMLC_NUM_WORKER=$dmlc_num_worker NVIDIA_VISIBLE_DEVICES=$nvidia_visible_devices DISTRIBUTED_FRAMEWORK=$distributed_framework BYTEPS_SERVER_ENABLE_SCHEDULE=$byteps_server_enable_schedule BYTEPS_SCHEDULING_CREDIT=$byteps_scheduling_credit BYTEPS_PARTITION_BYTES=$byteps_partition_bytes DMLC_PS_ROOT_URI=$dmlc_ps_root_uri DMLC_PS_ROOT_PORT=8000 bpslaunch $COMMAND &" &

Is there something I set wrong? I run on 4 AWS p3dn.24xlarge, and these machines are set in one cluster placement group.

ymjiang commented 3 years ago

The scheduler's log looks good to me. Perhaps you can test b8948f0 to verify.

Regarding the performance, you are using a small batch size (32) and running with TCP. Both factors lead to high overhead. If you want to compare with the number in performance.md, at least you need to use batch size 64.

Here are a few tips for tuning the performance -- https://github.com/bytedance/byteps/blob/master/docs/best-practice.md#multi-machine-distributed-mode. As far as I can see, you may use more server processes to try to saturate the TCP bandwidth.

Rivendile commented 3 years ago

I will try as you said. Thanks a lot!