Open lizi998 opened 3 years ago
Yes, it looks good to me. Did you meet any problems?
Yes, it looks good to me. Did you meet any problems?
In this way, server, worker0 and worker1 can create thread and waiting. Like this: server: BytePS launching server Command: python -c 'import byteps.server' [06:47:33] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [06:47:33] src/postoffice.cc:25: Creating Van: zmq [06:47:33] src/./zmq_van.h:299: Start ZMQ recv thread
worker: BytePS launching worker [07:38:39] src/postoffice.cc:25: Creating Van: zmq [07:38:39] src/./zmq_van.h:299: Start ZMQ recv thread
But scheduler has problem: BytePS launching scheduler Command: python -c 'import byteps.server'
[07:48:18] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [07:48:18] src/postoffice.cc:25: Creating Van: zmq [07:48:18] 3rdparty/ps-lite/include/dmlc/logging.h:276: [07:48:18] src/./zmq_van.h:121: Reached max retry for bind: Address already in use. errno = 98
Stack trace returned 10 entries: [bt] (0) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2300b) [0x7f96e619000b] [bt] (1) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x238b1) [0x7f96e61908b1] [bt] (2) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x649f4) [0x7f96e61d19f4] [bt] (3) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x4f26d) [0x7f96e61bc26d] [bt] (4) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x66779) [0x7f96e61d3779] [bt] (5) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x4a1be) [0x7f96e61b71be] [bt] (6) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps_server+0xd36) [0x7f96e618e5d6] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f96e66b7e40] [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f96e66b78ab] [bt] (9) /usr/local/python3/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x50c) [0x7f96e68cb27c]
terminate called after throwing an instance of 'dmlc::Error' what(): [07:48:18] src/./zmq_van.h:121: Reached max retry for bind: Address already in use. errno = 98
Stack trace returned 10 entries: [bt] (0) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2300b) [0x7f96e619000b] [bt] (1) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x238b1) [0x7f96e61908b1] [bt] (2) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x649f4) [0x7f96e61d19f4] [bt] (3) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x4f26d) [0x7f96e61bc26d] [bt] (4) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x66779) [0x7f96e61d3779] [bt] (5) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x4a1be) [0x7f96e61b71be] [bt] (6) /usr/local/python3/lib/python3.7/site-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps_server+0xd36) [0x7f96e618e5d6] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f96e66b7e40] [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f96e66b78ab] [bt] (9) /usr/local/python3/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x50c) [0x7f96e68cb27c]
Aborted (core dumped)
Traceback (most recent call last):
File "/home/byteps_torch/byteps_2/launcher/launch.py", line 220, in
What dose Reached max retry for bind: Address already in use. errno = 98 mean? Thanks.
Could you check if there is any unkilled process before you launch the task?
Could you check if there is any unkilled process before you launch the task? There should be no unkilled process before launch the task. Or could you tell me how to check and confirm?
I suddenly found out that you are the author of the article. Thank you very much for your reply. My WeChat ID is chestnut_man. Looking forward to be friends with you.
Could you check if there is any unkilled process before you launch the task?
Yes. You are right. I killed a process by command [lsof -i:port ]->[kill -9 PID]. Scheduler can create thread and waiting. But scheduler, server, worker0 and worker1 are all waiting. server: BytePS launching server Command: python -c 'import byteps.server' [06:47:33] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [06:47:33] src/postoffice.cc:25: Creating Van: zmq [06:47:33] src/./zmq_van.h:299: Start ZMQ recv thread
worker: BytePS launching worker [07:38:39] src/postoffice.cc:25: Creating Van: zmq [07:38:39] src/./zmq_van.h:299: Start ZMQ recv thread
scheduler: BytePS launching scheduler Command: python -c 'import byteps.server' [09:24:56] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [09:24:56] src/postoffice.cc:25: Creating Van: zmq [09:24:56] src/./zmq_van.h:299: Start ZMQ recv thread
Could you check the network connectivity, i.e., is 10.0.0.1
a reachable address for the other machine?
Describe the bug Thanks and look forward to your reply. In my docker, I successfully installed BYTEPS(Source installation, byteps== 0.2.5). Currently I only have 2 machines. Each machine has 8 V100 GPUs. How to deploy BYTEPS with 2 machines?
Like Distributed Training (TCP) in A Step-by-Step Tutorial, Is it OK like this?
1) For the docker-scheduler on machine 1:
now you are in docker-scheduler environment
export DMLC_NUM_WORKER=2 export DMLC_ROLE=scheduler export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP export DMLC_PS_ROOT_PORT=1234 # the scheduler port bpslaunch
2) For the docker-worker on machine 1:
now you are in docker-worker environment
export NVIDIA_VISIBLE_DEVICES=0,1,2,3 export DMLC_WORKER_ID=0 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP export DMLC_PS_ROOT_PORT=1234 # the scheduler port bpslaunch python /home/byteps_torch/byteps/example/pytorch/benchmark_byteps.py --model ResNet50 --num-iters 1000000
3) For the docker-server on machine 2:
now you are in docker-server environment
export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP export DMLC_PS_ROOT_PORT=1234 # the scheduler port bpslaunch
4) For the docker-worker on machine 2:
now you are in docker-worker environment
export NVIDIA_VISIBLE_DEVICES=0,1,2,3 export DMLC_WORKER_ID=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP export DMLC_PS_ROOT_PORT=1234 # the scheduler port bpslaunch python /home/byteps_torch/byteps/example/pytorch/benchmark_byteps.py --model ResNet50 --num-iters 1000000
To Reproduce Steps to reproduce the behavior: 1. 2. 3.
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context Add any other context about the problem here.