bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.62k stars 487 forks source link

Communication failure in MXNet with BytePS #436

Closed qingyangDuan closed 2 years ago

qingyangDuan commented 2 years ago

Describe the bug

I have installed BytePS for my MXNet. Bute when I try to run distributed training with MXNet, the worker works well, but the scheduler and server have some problems for ps-lite Van setup.

I have some experience of how ps-lite works for MXNet and how to run distributed training with MXNet.

Environment:

Screenshots

I use one worker, one server and one scheduler. I set BYTEPS_FORCE_DISTRIBUTED to 1. So I think it's okey to use only one worker. The following is my env setting. They should be all right. Besides, I tried 2 workers and 2 servers later. The problem is the same.

export DMLC_ROLE=$role
export DMLC_PS_ROOT_URI=192.168.1.101
export DMLC_PS_ROOT_PORT=50677
export PS_VERBOSE=1
export DMLC_NUM_SERVER=1
export DMLC_NUM_WORKER=1
export BYTEPS_LOCAL_RANK=0
export BYTEPS_LOCAL_SIZE=1
export DMLC_WORKER_ID=0
export BYTEPS_FORCE_DISTRIBUTED=1

The following is worker's output: It seems that it is waiting to connect to scheduler, but not done yet. Obviously, these report informations are from byteps/3rdparty/ps-lite.

INFO:root:Starting new image-classification task:, Namespace(batch_norm=False, batch_size=32, builtin_profiler=1, data_dir='', dataset='cifar10', dtype='float32', epochs=50, gpus='0', iterations=0, kvstore='dist_sync', log_interval=10, lr=0.1, lr_factor=0.1, lr_steps='30,60,90', mode='hybrid', model='vgg16', momentum=0.9, num_workers=4, optimizer='sgd', prefix='', profile=False, resume='', save_frequency=10, seed=123, start_epoch=0, use_pretrained=False, use_thumbnail=False, wd=0.0001)
[15:52:50] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[15:52:50] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[15:52:51] src/postoffice.cc:63: Creating Van: zmq. group_size=1
[15:52:51] src/./zmq_van.h:66: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[15:52:51] src/./zmq_van.h:71: BYTEPS_ZMQ_NTHREADS set to 4
[15:52:51] src/van.cc:581: Bind to [role=worker, ip=192.168.1.102, port=46333, is_recovery=0, aux_id=-1, num_ports=1]
[[15:52:51] src/./zmq_van.h:159: Zmq connecting to node [role=scheduler, id=1, ip=192.168.1.101, port=50677, is_recovery=0, aux_id=-1, num_ports=1]. My node is [role=worker, ip=192.168.1.102, port=46333, is_recovery=0, aux_id=-1, num_ports=1]
15:52:51] src/./zmq_van.h:351: Start ZMQ recv thread

The following is scheduler's output: The server's output doesn't have this error, but it also has ------------INITING FPTVan, use priority: 1----------. This line is my modifications in mxnet/3rdparty/ps-lite before installing BytePS.

------------INITING  FPTVan, use priority: 1----------

[15:50:51] src/van.cc:296: Bind to role=scheduler, id=1, ip=192.168.1.101, port=50677, is_recovery=0
terminate called after throwing an instance of 'dmlc::Error'
  what():  [15:50:51] src/van.cc:472: Check failed: pb.ParseFromArray(meta_buf, buf_size) failed to parse string into protobuf
Stack trace:
  [bt] (0) /home/duanqingyang/incubator-mxnet-1.5/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x34) [0x7f8adfa84334]
  [bt] (1) /home/duanqingyang/incubator-mxnet-1.5/lib/libmxnet.so(ps::Van::UnpackMeta(char const*, int, ps::Meta*)+0x422) [0x7f8ae296bfd2]
  [bt] (2) /home/duanqingyang/incubator-mxnet-1.5/lib/libmxnet.so(ps::ZMQVan::RecvMsg(ps::Message*)+0x3ce) [ 0x7f8ae297367e]
  [bt] (3) /home/duanqingyang/incubator-mxnet-1.5/lib/libmxnet.so(ps::Van::Receiving()+0x2ae) [0x7f8ae296ad5e]
  [bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbda50) [0x7f8b0ced5a50]
  [bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f8b1f0136db]
  [bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f8b1f34c71f]

The first line is my modifications in mxnet/3rdparty/ps-lite before installing BytePS. So this output means scheduler and server ( implenmentations in ps-lite ) are initialized with code in mxnet/3rdparty/ps-lite, but not in byteps/3rdparty/ps-lite. So this problem is because that scheduelr and server are implementations in mxnet/3rdparty/ps-lite, while worker 's implementation is in byteps/3rdparty/ps-lite. They have conflicts. I expect that scheduelr and server should also run the code in byteps/3rdparty/ps-lite. But I don't know why they didn't.

Here is my understanding

For original MXNet without BytePS, scheduler and server (implementations in ps-lite) are initialized when doing import mxnet ( it does import kvstore_server, which does the initialization). And of course they use implementation code in mxnet/3rdparty/ps-lite. I think this initialization pattern won't change with BytePS installed since BytePS doesn't change MXNet python files. I've checked the source code of BytePs, and it initializes a ps::KVWorker in byteps.mxnet.init(). But I don't know where it initializes ps::KVServer. So I'm not sure about the ps-lite communication setup of BytePS.

My questions

Does the python3 setup.py install of BytePS use libps.a (biuld form byteps/3rdparty/ps-lite) to build a new libmxnet.a? If yes, why my BytePS installation does't do this? lol . The server and scheduler still use old version of ps-lite. If no, how does BytePS initialize it's own server and scheduler at the beginning of training, and at the same time, prevent the initialization of server and scheduler in import mxnet (they may occupy the DMLC_PS_ROOT_PORT)?

If you can answer me these questions, then I can solve this problem. Thank you very much.

By the way

ymjiang commented 2 years ago

Hi qingyang, here's the clarification to some of your questions.

Does the python3 setup.py install of BytePS replace mxnet/3rdparty/ps-lite with byteps/3rdparty/ps-lite?

No.

If no, how does BytePS initialize it's own server and scheduler at the beginning of training, and at the same time, prevent the initialization of server and scheduler in import mxnet (they may occupy the DMLC_PS_ROOT_PORT)?

If you do not intend to use BytePS and MXNet KVStore at the same time, you don't need to worry about this. BytePS server is inited through "import byteps.server". You can check the code of byteps/server/__init__.py.


BTW: we do not maintain mxnet > 1.5.0

qingyangDuan commented 2 years ago

Ok, I get it. I fond "import byteps.server" in launch.py... Then I should add it to my training python file. But I still need to manually stop MXNet's initialization of server and scheduler in "import mxnet". Maybe MXNet 1.5.x is different from 1.5.0 . LOl.

qingyangDuan commented 2 years ago

Oh, I get it. In server and scheduler side. I only need to run "import byteps.server" and nothing else. Then I don't have all those problems mentioned before. Thanks again.