bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 488 forks source link

distributed benchmark has problems #33

Open flynnamy opened 5 years ago

flynnamy commented 5 years ago

I has three nodes: first node I run: scheduler and server mode second node I run: worker0 third node I run: worker1 The problem is that woker nodes has been hanged It shows in first node: BytePS launching scheduler BytePS launching server second node: image third node has same problem: image So please tell me how to solve this problem, thanks

Best regards

ymjiang commented 5 years ago

Can you please provide the complete cmds that you use? We need to make sure you have correct configurations (e.g., scheduler ip and port, etc).

flynnamy commented 5 years ago

I follow your docs completely,commands as following:

  1. scheduler.sh: export DMLC_NUM_WORKER=2 export DMLC_ROLE=scheduler export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.5.37.72 export DMLC_PS_ROOT_PORT=1234 python /home/fws/byteps/launcher/launch.py

server.sh: export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.5.37.72 export DMLC_PS_ROOT_PORT=1234 python /home/fws/byteps/launcher/launch.py

2.worker0.sh: export NVIDIA_VISIBLE_DEVICES=0 export DMLC_WORKER_ID=0 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.5.37.72 export DMLC_PS_ROOT_PORT=1234 export EVAL_TYPE=benchmark python /home/fws/byteps/launcher/launch.py /home/fws/byteps/example/pytorch/start_pytorch_byteps.sh --model vgg16 --num-iters 100 --batch-size 64

3.worker1.sh: export NVIDIA_VISIBLE_DEVICES=0 export DMLC_WORKER_ID=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.5.37.72 export DMLC_PS_ROOT_PORT=1234 export EVAL_TYPE=benchmark python /home/fws/byteps/launcher/launch.py /home/fws/byteps/example/pytorch/start_pytorch_byteps.sh --model vgg16 --num-iters 100 --batch-size 64

On10.5.37.72, I do bash scheduler.sh and bash server.sh On 10.5.37.73,I do bash worker0.sh On 10.5.37.74,I do bash worker1.sh

Otherwise,I can execute well in single machine,results are right. So, my commands have problems?Thanks for your reply!

bobzhuyb commented 5 years ago

From your command, I guess you did not use our docker images. I have a few questions --

  1. For server and scheduler, did you build our modified mxnet, as explained in README.md?

  2. Are you sure that both of your workers can connect to 10.5.37.72:1234?

  3. When you said "execute well in single machine,results are right.", do you mean you run sever, scheduler, workers on a single machine, or do you mean that you just run a worker in a single machine?

  4. Can you set PS_VERBOSE=2 for all workers, scheduler and server? Then paste the log output here. Thank you.

flynnamy commented 5 years ago

Yes, I do not use docker images, I run on our own cluster.In my opinion, it does not depend on nvidia-docker,right?Or it uses nvidia-docker as required? 1.I run pytorch scripts, i do not build your modified mxnet 2.It does,I test distribted scripts of horovod 3.I just run worker in a single machine 4.I set PS_VERBOSE=2,its output:

flynnamy commented 5 years ago

![Uploading 1562047453(1).png…]() ![Uploading 1562047761(1).jpg…]()

flynnamy commented 5 years ago

On 73 node: BytePS launching worker running benchmark... [14:01:04] src/customer.cc:363: Do not use thread pool for receiving. [14:01:04] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:04] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:04] 14:01:04] src/./zmq_van.h:285: Start ZMQ recv thread src/van.cc:357: Bind to role=worker, ip=10.5.37.73, port=43871, is_recovery=0 [14:01:04] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.73, port=43871, is_recovery=0 } }. THIS IS NOT DATA MSG!

On 74 node: BytePS launching worker running benchmark... [14:01:03] src/customer.cc:363: Do not use thread pool for receiving. [14:01:03] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:03] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:03] src/van.cc:357: Bind to 14:01:03] src/./zmq_van.h:role=worker, ip=10.5.37.74, port=42934, is_recovery=0285: Start ZMQ recv thread

[14:01:03] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.74, port=42934, is_recovery=0 } }. THIS IS NOT DATA MSG!

bobzhuyb commented 5 years ago

@flynnamy Please see README.md

...Otherwise, you have to manually compile our modified MXNet as in our Dockerfile.

If you don't use our docker image, you must compile our modified MXNet. The server part in BytePS is modified from MXNet. Even if you are running PyTorch or TF workers, you still need our MXNet as the server. You can use the same commands here https://github.com/bytedance/byteps/blob/master/docker/Dockerfile.server#L75

We highly recommend you to just use our docker image bytepsimage/byteps_server for server and scheduler. It does not need nvidia-docker, since it does not need GPUs. You can run it with common docker.

docker run -it --net=host bytepsimage/byteps_server bash
ymjiang commented 5 years ago

@flynnamy We have provided a pypi source for server / scheduler. If you don't use docker then you can try to install from pypi. Refer to https://github.com/bytedance/byteps/blob/master/docs/pip-list.md

flynnamy commented 5 years ago

@ymjiang Hi, I have a problem when I install byteps_pytorch. I can install official pytorch instead byteps_pytorch? Both of byteps_pytorch has problems. AssertionError: byteps-pytorch1.0.1-cu90==0.1.0 .dist-info directory not found AssertionError: byteps-pytorch1.1.0-cu90==0.1.0 .dist-info directory not found

ymjiang commented 5 years ago

@flynnamy Please read the instructions.

wget -O byteps-0.1.0-cp27-none-any.whl YOUR_WHEEL_URL
python -m pip install --index-url https://test.pypi.org/simple/ --no-deps byteps-0.1.0-cp27-none-any.whl

Did you missed the first step?

If you have already downloaded the wheel package, rename it to byteps-0.1.0-cp27-none-any.whl and try again.

flynnamy commented 5 years ago

This step ok,It shows Requirement already satisfied: byteps==0.1.0 but byteps-pytorch has problems

flynnamy commented 5 years ago

It is ok,I do it in two steps

ymjiang commented 5 years ago

OK. Glad that it works.

flynnamy commented 5 years ago

When I run benchmark, it shows ImportError: No module named torch.backends.cudnn I think it needs modify code when I use byteps-0.1.0?

ymjiang commented 5 years ago

Are you running our benchmark scripts? If yes, then there there is no need to modify codes. Can you make sure your pytorch is properly installed? The error seems not relevant to byteps at all.

flynnamy commented 5 years ago

My command like this: 1.wget -O byteps-0.1.0-cp27-none-any.whl https://test-files.pythonhosted.org/packages/db/6f/c99266a52e71d4df875fdf3ff3fa073b98424ea0a7182a0237b1930d34be/byteps_pytorch1.1.0_cu90-0.1.0-cp27-none-any.whl 2.python -m pip install --index-url https://test.pypi.org/simple/ --no-deps byteps-0.1.0-cp27-none-any.whl It show that Processing ./byteps-0.1.0-cp27-none-any.whl Installing collected packages: byteps Successfully installed byteps-0.1.0 So it means that pytorch1.1 has been installed?Or what else do I need to do?

ymjiang commented 5 years ago

No, your two commands only install byteps. You need to install pytorch before them.

ymjiang commented 5 years ago

Please read the README: BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet.

flynnamy commented 5 years ago

I make a mistake, I guess you wrap them.

flynnamy commented 5 years ago

Could you support python3 version?

ymjiang commented 5 years ago

Did you come across anything? The pytorch example in our tutorial can run with python3.

flynnamy commented 5 years ago

I see your byteps_server is py2 wheel,right?I think it needs python3 version?

ymjiang commented 5 years ago

I see. We will release python3 wheel. For now if you want to quickly try out BytePS, please follow the tutorial (using docker). That will save you a lot of time dealing with installation and environments. https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md

BTW, does it run into errors if you (force) use python3 to launch?

bobzhuyb commented 5 years ago

@flynnamy Do you still have problems besides the python3 request? Does python2 work for you for now?

flynnamy commented 5 years ago

python2 works for me,I have not solve distributed training without docker.

bobzhuyb commented 5 years ago

@flynnamy So you can run distributed training in docker now?

What is the problem without docker?

flynnamy commented 5 years ago

It is not convenient to use docker in our cluster,so I want to do distributed training without docker. I think I should compile modified mxnet. On the other hand, I pip install byteps_server-1.5.0-py2-none-any.whl, so I do not need bytepsimage/byteps_server, right?

bobzhuyb commented 5 years ago

Right. If you can find the right .whl version for your environment, you do not need docker.

Basically, you need just one of the three following options:

  1. use docker images
  2. install two .whl, one for worker (if you can find the version you need), and one for server/scheduler
  3. Build from source, python setup.py install for workers, and compile the modified MXNet for server/scheduler
flynnamy commented 5 years ago

I choose the second option. It has same problems: On 73 node: BytePS launching worker running benchmark... [14:01:04] src/customer.cc:363: Do not use thread pool for receiving. [14:01:04] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:04] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:04] 14:01:04] src/./zmq_van.h:285: Start ZMQ recv thread src/van.cc:357: Bind to role=worker, ip=10.5.37.73, port=43871, is_recovery=0 [14:01:04] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.73, port=43871, is_recovery=0 } }. THIS IS NOT DATA MSG!

On 74 node: BytePS launching worker running benchmark... [14:01:03] src/customer.cc:363: Do not use thread pool for receiving. [14:01:03] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:03] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:03] src/van.cc:357: Bind to 14:01:03] src/./zmq_van.h:role=worker, ip=10.5.37.74, port=42934, is_recovery=0285: Start ZMQ recv thread

[14:01:03] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.74, port=42934, is_recovery=0 } }. THIS IS NOT DATA MSG!

How to solve this?

ymjiang commented 5 years ago

@flynnamy Did you launch the server and scheduler? If you did, can you also show the log output of them? If not, please follow the instructions here: https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#distributed-training

flynnamy commented 5 years ago

Yes,I launch server and scheduler in 72 node firstly,then I launch worker0 in 73 node, worker1 in 74 node. But it hanged. The output of 73 and 74 node is above,like this: On 73 node: BytePS launching worker running benchmark... [14:01:04] src/customer.cc:363: Do not use thread pool for receiving. [14:01:04] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:04] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:04] 14:01:04] src/./zmq_van.h:285: Start ZMQ recv thread src/van.cc:357: Bind to role=worker, ip=10.5.37.73, port=43871, is_recovery=0 [14:01:04] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.73, port=43871, is_recovery=0 } }. THIS IS NOT DATA MSG!

On 74 node: BytePS launching worker running benchmark... [14:01:03] src/customer.cc:363: Do not use thread pool for receiving. [14:01:03] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [14:01:03] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [[14:01:03] src/van.cc:357: Bind to 14:01:03] src/./zmq_van.h:role=worker, ip=10.5.37.74, port=42934, is_recovery=0285: Start ZMQ recv thread

[14:01:03] src/van.cc:446: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.5.37.74, port=42934, is_recovery=0 } }. THIS IS NOT DATA MSG!

72 scheduler and server node,it shows: BytePS launching scheduler BytePS launching server

No furture information.So is there a way to get more information or what it would to do after src/van.cc:446 ?

bobzhuyb commented 5 years ago

The scheduler and server node should output more, especially with PS_VERBOSE=2.

Did you set BYTEPS_SERVER_MXNET_PATH, and if so, what is it?

https://github.com/bytedance/byteps/blob/master/launcher/launch.py#L44 This must be pointing to the path that you installed our modified MXNet. Otherwise, it will just import regular MXNet, which is not compatible with BytePS workers.

flynnamy commented 5 years ago

I pip install scheduler & server wheel and do not set BYTEPS_SERVER_MXNET_PATH.So It would install your modified mxnet as server.I do not need to compiler your modified mxnet,right?Or I should set BYTEPS_SERVER_MXNET_PATH?

ymjiang commented 5 years ago

@flynnamy We just discovered the server & scheduler wheel does not work as expected. We are sorry about this and will fix this soon.

ymjiang commented 5 years ago

@flynnamy The server & scheduler wheel is now available at: (including how to use them) https://github.com/bytedance/byteps/blob/master/docs/pip-list.md#server--scheduler