Open flynnamy opened 5 years ago
Can you please provide the complete cmds that you use? We need to make sure you have correct configurations (e.g., scheduler ip and port, etc).
I follow your docs completely,commands as following:
server.sh: export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.5.37.72 export DMLC_PS_ROOT_PORT=1234 python /home/fws/byteps/launcher/launch.py
2.worker0.sh: export NVIDIA_VISIBLE_DEVICES=0 export DMLC_WORKER_ID=0 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.5.37.72 export DMLC_PS_ROOT_PORT=1234 export EVAL_TYPE=benchmark python /home/fws/byteps/launcher/launch.py /home/fws/byteps/example/pytorch/start_pytorch_byteps.sh --model vgg16 --num-iters 100 --batch-size 64
3.worker1.sh: export NVIDIA_VISIBLE_DEVICES=0 export DMLC_WORKER_ID=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.5.37.72 export DMLC_PS_ROOT_PORT=1234 export EVAL_TYPE=benchmark python /home/fws/byteps/launcher/launch.py /home/fws/byteps/example/pytorch/start_pytorch_byteps.sh --model vgg16 --num-iters 100 --batch-size 64
On10.5.37.72, I do bash scheduler.sh and bash server.sh On 10.5.37.73,I do bash worker0.sh On 10.5.37.74,I do bash worker1.sh
Otherwise,I can execute well in single machine,results are right. So, my commands have problems?Thanks for your reply!
From your command, I guess you did not use our docker images. I have a few questions --
For server and scheduler, did you build our modified mxnet, as explained in README.md?
Are you sure that both of your workers can connect to 10.5.37.72:1234?
When you said "execute well in single machine,results are right.", do you mean you run sever, scheduler, workers on a single machine, or do you mean that you just run a worker in a single machine?
Can you set PS_VERBOSE=2 for all workers, scheduler and server? Then paste the log output here. Thank you.
Yes, I do not use docker images, I run on our own cluster.In my opinion, it does not depend on nvidia-docker,right?Or it uses nvidia-docker as required? 1.I run pytorch scripts, i do not build your modified mxnet 2.It does,I test distribted scripts of horovod 3.I just run worker in a single machine 4.I set PS_VERBOSE=2,its output:
![Uploading 1562047453(1).png…]() ![Uploading 1562047761(1).jpg…]()
On 73 node: BytePS launching worker running benchmark... [14:01:04] src/customer.cc:363: Do not use thread pool for receiving. [14:01:04] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:04] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:04] 14:01:04] src/./zmq_van.h:285: Start ZMQ recv thread src/van.cc:357: Bind to role=worker, ip=10.5.37.73, port=43871, is_recovery=0 [14:01:04] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.73, port=43871, is_recovery=0 } }. THIS IS NOT DATA MSG!
On 74 node: BytePS launching worker running benchmark... [14:01:03] src/customer.cc:363: Do not use thread pool for receiving. [14:01:03] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:03] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:03] src/van.cc:357: Bind to 14:01:03] src/./zmq_van.h:role=worker, ip=10.5.37.74, port=42934, is_recovery=0285: Start ZMQ recv thread
[14:01:03] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.74, port=42934, is_recovery=0 } }. THIS IS NOT DATA MSG!
@flynnamy Please see README.md
...Otherwise, you have to manually compile our modified MXNet as in our Dockerfile.
If you don't use our docker image, you must compile our modified MXNet. The server part in BytePS is modified from MXNet. Even if you are running PyTorch or TF workers, you still need our MXNet as the server. You can use the same commands here https://github.com/bytedance/byteps/blob/master/docker/Dockerfile.server#L75
We highly recommend you to just use our docker image bytepsimage/byteps_server
for server and scheduler. It does not need nvidia-docker, since it does not need GPUs. You can run it with common docker.
docker run -it --net=host bytepsimage/byteps_server bash
@flynnamy We have provided a pypi source for server / scheduler. If you don't use docker then you can try to install from pypi. Refer to https://github.com/bytedance/byteps/blob/master/docs/pip-list.md
@ymjiang Hi, I have a problem when I install byteps_pytorch. I can install official pytorch instead byteps_pytorch? Both of byteps_pytorch has problems. AssertionError: byteps-pytorch1.0.1-cu90==0.1.0 .dist-info directory not found AssertionError: byteps-pytorch1.1.0-cu90==0.1.0 .dist-info directory not found
@flynnamy Please read the instructions.
wget -O byteps-0.1.0-cp27-none-any.whl YOUR_WHEEL_URL
python -m pip install --index-url https://test.pypi.org/simple/ --no-deps byteps-0.1.0-cp27-none-any.whl
Did you missed the first step?
If you have already downloaded the wheel package, rename it to byteps-0.1.0-cp27-none-any.whl
and try again.
This step ok,It shows Requirement already satisfied: byteps==0.1.0 but byteps-pytorch has problems
It is ok,I do it in two steps
OK. Glad that it works.
When I run benchmark, it shows ImportError: No module named torch.backends.cudnn I think it needs modify code when I use byteps-0.1.0?
Are you running our benchmark scripts? If yes, then there there is no need to modify codes. Can you make sure your pytorch is properly installed? The error seems not relevant to byteps at all.
My command like this: 1.wget -O byteps-0.1.0-cp27-none-any.whl https://test-files.pythonhosted.org/packages/db/6f/c99266a52e71d4df875fdf3ff3fa073b98424ea0a7182a0237b1930d34be/byteps_pytorch1.1.0_cu90-0.1.0-cp27-none-any.whl 2.python -m pip install --index-url https://test.pypi.org/simple/ --no-deps byteps-0.1.0-cp27-none-any.whl It show that Processing ./byteps-0.1.0-cp27-none-any.whl Installing collected packages: byteps Successfully installed byteps-0.1.0 So it means that pytorch1.1 has been installed?Or what else do I need to do?
No, your two commands only install byteps. You need to install pytorch before them.
Please read the README: BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet.
I make a mistake, I guess you wrap them.
Could you support python3 version?
Did you come across anything? The pytorch example in our tutorial can run with python3.
I see your byteps_server is py2 wheel,right?I think it needs python3 version?
I see. We will release python3 wheel. For now if you want to quickly try out BytePS, please follow the tutorial (using docker). That will save you a lot of time dealing with installation and environments. https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md
BTW, does it run into errors if you (force) use python3 to launch?
@flynnamy Do you still have problems besides the python3 request? Does python2 work for you for now?
python2 works for me,I have not solve distributed training without docker.
@flynnamy So you can run distributed training in docker now?
What is the problem without docker?
It is not convenient to use docker in our cluster,so I want to do distributed training without docker. I think I should compile modified mxnet. On the other hand, I pip install byteps_server-1.5.0-py2-none-any.whl, so I do not need bytepsimage/byteps_server, right?
Right. If you can find the right .whl version for your environment, you do not need docker.
Basically, you need just one of the three following options:
python setup.py install
for workers, and compile the modified MXNet for server/schedulerI choose the second option. It has same problems: On 73 node: BytePS launching worker running benchmark... [14:01:04] src/customer.cc:363: Do not use thread pool for receiving. [14:01:04] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:04] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:04] 14:01:04] src/./zmq_van.h:285: Start ZMQ recv thread src/van.cc:357: Bind to role=worker, ip=10.5.37.73, port=43871, is_recovery=0 [14:01:04] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.73, port=43871, is_recovery=0 } }. THIS IS NOT DATA MSG!
On 74 node: BytePS launching worker running benchmark... [14:01:03] src/customer.cc:363: Do not use thread pool for receiving. [14:01:03] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:03] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:03] src/van.cc:357: Bind to 14:01:03] src/./zmq_van.h:role=worker, ip=10.5.37.74, port=42934, is_recovery=0285: Start ZMQ recv thread
[14:01:03] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.74, port=42934, is_recovery=0 } }. THIS IS NOT DATA MSG!
How to solve this?
@flynnamy Did you launch the server and scheduler? If you did, can you also show the log output of them? If not, please follow the instructions here: https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#distributed-training
Yes,I launch server and scheduler in 72 node firstly,then I launch worker0 in 73 node, worker1 in 74 node. But it hanged. The output of 73 and 74 node is above,like this: On 73 node: BytePS launching worker running benchmark... [14:01:04] src/customer.cc:363: Do not use thread pool for receiving. [14:01:04] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:04] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:04] 14:01:04] src/./zmq_van.h:285: Start ZMQ recv thread src/van.cc:357: Bind to role=worker, ip=10.5.37.73, port=43871, is_recovery=0 [14:01:04] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.73, port=43871, is_recovery=0 } }. THIS IS NOT DATA MSG!
On 74 node: BytePS launching worker running benchmark... [14:01:03] src/customer.cc:363: Do not use thread pool for receiving. [14:01:03] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:03] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:03] src/van.cc:357: Bind to 14:01:03] src/./zmq_van.h:role=worker, ip=10.5.37.74, port=42934, is_recovery=0285: Start ZMQ recv thread
[14:01:03] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.74, port=42934, is_recovery=0 } }. THIS IS NOT DATA MSG!
72 scheduler and server node,it shows: BytePS launching scheduler BytePS launching server
No furture information.So is there a way to get more information or what it would to do after src/van.cc:446 ?
The scheduler and server node should output more, especially with PS_VERBOSE=2.
Did you set BYTEPS_SERVER_MXNET_PATH, and if so, what is it?
https://github.com/bytedance/byteps/blob/master/launcher/launch.py#L44 This must be pointing to the path that you installed our modified MXNet. Otherwise, it will just import regular MXNet, which is not compatible with BytePS workers.
I pip install scheduler & server wheel and do not set BYTEPS_SERVER_MXNET_PATH.So It would install your modified mxnet as server.I do not need to compiler your modified mxnet,right?Or I should set BYTEPS_SERVER_MXNET_PATH?
@flynnamy We just discovered the server & scheduler wheel does not work as expected. We are sorry about this and will fix this soon.
@flynnamy The server & scheduler wheel is now available at: (including how to use them) https://github.com/bytedance/byteps/blob/master/docs/pip-list.md#server--scheduler
I has three nodes: first node I run: scheduler and server mode second node I run: worker0 third node I run: worker1 The problem is that woker nodes has been hanged It shows in first node: BytePS launching scheduler BytePS launching server second node: third node has same problem: So please tell me how to solve this problem, thanks
Best regards