Closed qingyangDuan closed 2 years ago
Hi qingyang, here's the clarification to some of your questions.
Does the python3 setup.py install of BytePS replace mxnet/3rdparty/ps-lite with byteps/3rdparty/ps-lite?
No.
If no, how does BytePS initialize it's own server and scheduler at the beginning of training, and at the same time, prevent the initialization of server and scheduler in import mxnet (they may occupy the DMLC_PS_ROOT_PORT)?
If you do not intend to use BytePS and MXNet KVStore at the same time, you don't need to worry about this. BytePS server is inited through "import byteps.server". You can check the code of byteps/server/__init__.py
.
BTW: we do not maintain mxnet > 1.5.0
Ok, I get it. I fond "import byteps.server" in launch.py... Then I should add it to my training python file. But I still need to manually stop MXNet's initialization of server and scheduler in "import mxnet". Maybe MXNet 1.5.x is different from 1.5.0 . LOl.
Oh, I get it. In server and scheduler side. I only need to run "import byteps.server" and nothing else. Then I don't have all those problems mentioned before. Thanks again.
Describe the bug
I have installed BytePS for my MXNet. Bute when I try to run distributed training with MXNet, the worker works well, but the scheduler and server have some problems for ps-lite Van setup.
I have some experience of how ps-lite works for MXNet and how to run distributed training with MXNet.
Environment:
Screenshots
I use one worker, one server and one scheduler. I set BYTEPS_FORCE_DISTRIBUTED to 1. So I think it's okey to use only one worker. The following is my env setting. They should be all right. Besides, I tried 2 workers and 2 servers later. The problem is the same.
The following is worker's output: It seems that it is waiting to connect to scheduler, but not done yet. Obviously, these report informations are from byteps/3rdparty/ps-lite.
The following is scheduler's output: The server's output doesn't have this error, but it also has
------------INITING FPTVan, use priority: 1----------
. This line is my modifications in mxnet/3rdparty/ps-lite before installing BytePS.The first line is my modifications in mxnet/3rdparty/ps-lite before installing BytePS. So this output means scheduler and server ( implenmentations in ps-lite ) are initialized with code in mxnet/3rdparty/ps-lite, but not in byteps/3rdparty/ps-lite. So this problem is because that scheduelr and server are implementations in mxnet/3rdparty/ps-lite, while worker 's implementation is in byteps/3rdparty/ps-lite. They have conflicts. I expect that scheduelr and server should also run the code in byteps/3rdparty/ps-lite. But I don't know why they didn't.
Here is my understanding
For original MXNet without BytePS, scheduler and server (implementations in ps-lite) are initialized when doing
import mxnet
( it doesimport kvstore_server
, which does the initialization). And of course they use implementation code in mxnet/3rdparty/ps-lite. I think this initialization pattern won't change with BytePS installed since BytePS doesn't change MXNet python files. I've checked the source code of BytePs, and it initializes a ps::KVWorker inbyteps.mxnet.init()
. But I don't know where it initializes ps::KVServer. So I'm not sure about the ps-lite communication setup of BytePS.My questions
Does the
python3 setup.py install
of BytePS use libps.a (biuld form byteps/3rdparty/ps-lite) to build a new libmxnet.a? If yes, why my BytePS installation does't do this? lol . The server and scheduler still use old version of ps-lite. If no, how does BytePS initialize it's own server and scheduler at the beginning of training, and at the same time, prevent the initialization of server and scheduler inimport mxnet
(they may occupy the DMLC_PS_ROOT_PORT)?If you can answer me these questions, then I can solve this problem. Thank you very much.
By the way
This is my training python file: https://github.com/qingyangDuan/Mercury/blob/main/source%20code/image_classification_bps.py I think it utilizes BytePS in the right way.
I have installed ByteScheduler for MXNet before installing BytePS. I guess it does't impact my analysis.
The following is my modication in
setup.py
for BytePS because it cannot find include headers in mxnet/3rdparty/*. So I add some path into INCLUDES. I'm not sure whether it causes my problem or not.