bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

How to deploy BYTEPS with 2 machines? #353

Open lizi998 opened 3 years ago

lizi998 commented 3 years ago

Describe the bug Thanks and look forward to your reply. In my docker, I successfully installed BYTEPS(Source installation, byteps== 0.2.5). Currently I only have 2 machines. Each machine has 8 V100 GPUs. How to deploy BYTEPS with 2 machines?

Like Distributed Training (TCP) in A Step-by-Step Tutorial, Is it OK like this?

  1. Create 2 dockers on machine 1. They are docker-scheduler and docker-worker.
  2. Create 2 dockers on machine 2. They are docker-server and docker-worker.

1) For the docker-scheduler on machine 1:

now you are in docker-scheduler environment

export DMLC_NUM_WORKER=2 export DMLC_ROLE=scheduler export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP export DMLC_PS_ROOT_PORT=1234 # the scheduler port bpslaunch

2) For the docker-worker on machine 1:

now you are in docker-worker environment

export NVIDIA_VISIBLE_DEVICES=0,1,2,3 export DMLC_WORKER_ID=0 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP export DMLC_PS_ROOT_PORT=1234 # the scheduler port bpslaunch python /home/byteps_torch/byteps/example/pytorch/benchmark_byteps.py --model ResNet50 --num-iters 1000000

3) For the docker-server on machine 2:

now you are in docker-server environment

export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP export DMLC_PS_ROOT_PORT=1234 # the scheduler port bpslaunch

4) For the docker-worker on machine 2:

now you are in docker-worker environment

export NVIDIA_VISIBLE_DEVICES=0,1,2,3 export DMLC_WORKER_ID=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP export DMLC_PS_ROOT_PORT=1234 # the scheduler port bpslaunch python /home/byteps_torch/byteps/example/pytorch/benchmark_byteps.py --model ResNet50 --num-iters 1000000

To Reproduce Steps to reproduce the behavior: 1. 2. 3.

  1. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

ymjiang commented 3 years ago

Yes, it looks good to me. Did you meet any problems?

lizi998 commented 3 years ago

Yes, it looks good to me. Did you meet any problems?

In this way, server, worker0 and worker1 can create thread and waiting. Like this: server: BytePS launching server Command: python -c 'import byteps.server' [06:47:33] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [06:47:33] src/postoffice.cc:25: Creating Van: zmq [06:47:33] src/./zmq_van.h:299: Start ZMQ recv thread

worker: BytePS launching worker [07:38:39] src/postoffice.cc:25: Creating Van: zmq [07:38:39] src/./zmq_van.h:299: Start ZMQ recv thread

But scheduler has problem: BytePS launching scheduler Command: python -c 'import byteps.server'

[07:48:18] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [07:48:18] src/postoffice.cc:25: Creating Van: zmq [07:48:18] 3rdparty/ps-lite/include/dmlc/logging.h:276: [07:48:18] src/./zmq_van.h:121: Reached max retry for bind: Address already in use. errno = 98

Stack trace returned 10 entries: [bt] (0) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2300b) [0x7f96e619000b] [bt] (1) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x238b1) [0x7f96e61908b1] [bt] (2) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x649f4) [0x7f96e61d19f4] [bt] (3) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x4f26d) [0x7f96e61bc26d] [bt] (4) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x66779) [0x7f96e61d3779] [bt] (5) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x4a1be) [0x7f96e61b71be] [bt] (6) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps_server+0xd36) [0x7f96e618e5d6] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f96e66b7e40] [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f96e66b78ab] [bt] (9) /usr/local/python3/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x50c) [0x7f96e68cb27c]

terminate called after throwing an instance of 'dmlc::Error' what(): [07:48:18] src/./zmq_van.h:121: Reached max retry for bind: Address already in use. errno = 98

Stack trace returned 10 entries: [bt] (0) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2300b) [0x7f96e619000b] [bt] (1) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x238b1) [0x7f96e61908b1] [bt] (2) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x649f4) [0x7f96e61d19f4] [bt] (3) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x4f26d) [0x7f96e61bc26d] [bt] (4) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x66779) [0x7f96e61d3779] [bt] (5) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x4a1be) [0x7f96e61b71be] [bt] (6) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps_server+0xd36) [0x7f96e618e5d6] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f96e66b7e40] [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f96e66b78ab] [bt] (9) /usr/local/python3/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x50c) [0x7f96e68cb27c]

Aborted (core dumped) Traceback (most recent call last): File "/home/byteps_torch/byteps_2/launcher/launch.py", line 220, in launch_bps() File "/home/byteps_torch/byteps_2/launcher/launch.py", line 216, in launch_bps stdout=sys.stdout, stderr=sys.stderr, shell=True) File "/usr/local/python3/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python -c 'import byteps.server'' returned non-zero exit status 134.

What dose Reached max retry for bind: Address already in use. errno = 98 mean? Thanks.

ymjiang commented 3 years ago

Could you check if there is any unkilled process before you launch the task?

lizi998 commented 3 years ago

Could you check if there is any unkilled process before you launch the task? There should be no unkilled process before launch the task. Or could you tell me how to check and confirm?

I suddenly found out that you are the author of the article. Thank you very much for your reply. My WeChat ID is chestnut_man. Looking forward to be friends with you.

lizi998 commented 3 years ago

Could you check if there is any unkilled process before you launch the task?

Yes. You are right. I killed a process by command [lsof -i:port ]->[kill -9 PID]. Scheduler can create thread and waiting. But scheduler, server, worker0 and worker1 are all waiting. server: BytePS launching server Command: python -c 'import byteps.server' [06:47:33] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [06:47:33] src/postoffice.cc:25: Creating Van: zmq [06:47:33] src/./zmq_van.h:299: Start ZMQ recv thread

worker: BytePS launching worker [07:38:39] src/postoffice.cc:25: Creating Van: zmq [07:38:39] src/./zmq_van.h:299: Start ZMQ recv thread

scheduler: BytePS launching scheduler Command: python -c 'import byteps.server' [09:24:56] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [09:24:56] src/postoffice.cc:25: Creating Van: zmq [09:24:56] src/./zmq_van.h:299: Start ZMQ recv thread

ymjiang commented 3 years ago

Could you check the network connectivity, i.e., is 10.0.0.1 a reachable address for the other machine?