Open ruipeterpan opened 3 years ago
For now, the asynchronous mode for PyTorch is supported only when you use the DistributedOptimizer approach, like this example. Your code now uses the DDP wrapper, for which we haven't implemented the async-mode yet.
Hey @ymjiang thanks for the info! Nevertheless, after switching to using benchmark_byteps.py, the issue is still there.
FYI:
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000
on the workers
root@536f767106da4e48b0f29957b21b64da000000:/# bpslaunch
BytePS launching server
Command: python3 -c 'import byteps.server'
[06:22:03] byteps/server/server.cc:419: BytePS server is enabled asynchronous training [06:22:03] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [06:22:03] src/postoffice.cc:25: Creating Van: zmq [06:22:03] src/./zmq_van.h:299: Start ZMQ recv thread [06:22:15] 3rdparty/ps-lite/include/dmlc/logging.h:276: [06:22:15] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fe2ba4d699c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fe2ba4d6ddd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer
terminate called after throwing an instance of 'dmlc::Error' what(): [06:22:15] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fe2ba4d699c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fe2ba4d6ddd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer
Aborted (core dumped)
Traceback (most recent call last):
File "/usr/local/bin/bpslaunch", line 4, in
This error indicates that you did not kill the previous bpslaunch process. @ruipeterpan
pkill bpslaunch; pkill python3
@vycezhong Thanks for your help. I double-checked to make sure no bps-related processes are alive (both inside and outside of all containers) before launching the server, yet it still crashes. Am I doing something dumb?
@ruipeterpan Did you launch any worker in your example above? The server should do nothing but wait when there is no worker.
@vycezhong If only the server gets launched, it starts ZMQ recv thread and waits w/o an error. As soon as the workers are launched, the server crashes.
OK I can reproduce it. I will look at it.
@vycezhong Could this problem be related to https://github.com/bytedance/byteps/pull/225?
My experience is that v0.2.4 worked well (e.g., see https://github.com/bytedance/byteps/issues/271).
@vycezhong Could this problem be related to #225?
My experience is that v0.2.4 worked well (e.g., see #271).
Yes. It is because I use update_buf
for pulling. I will fix it then.
@ruipeterpan Could you please test if https://github.com/bytedance/byteps/pull/359 fixes your issue?
@vycezhong thanks for the fix! The server-crashing issue is resolved by #359, but I'm seeing some weird behavior for the training loss curve after applying the changes in the PR. I'll spend some time double-checking to make sure it's not a problem on my end.
@ruipeterpan You also need to enable async for workers.
@ruipeterpan You also need to enable async for workers.
I had already toggled BYTEPS_ENABLE_ASYNC for all workers, servers & the scheduler for both async mode and sync mode.
@ymjiang I think it should be Parameter
here? AsyncParam
in servers will be initialized with random values.
@ymjiang I think it should be
Parameter
here?AsyncParam
in servers will be initialized with random values.
I do not get it. The stored
buffer will be initialized with the first incoming recv
, which contains the value of the parameters.
@ruipeterpan Would you test v0.2.4? My loss curve with v0.2.4 seems fine (it is at least decreasing).
The first incoming recv
should be random values.
@ymjiang Here's the loss curve I got for both sync and async using v0.2.4 (809ef20)
@ruipeterpan Please try this commit. https://github.com/bytedance/byteps/pull/359/commits/7ac1dc74335b8935e4ac897e8d92d9c563fdf110
@vycezhong Here's what I got using https://github.com/bytedance/byteps/commit/7ac1dc74335b8935e4ac897e8d92d9c563fdf110 and the original scripts (bps_issue.py) I provided:
Then I commented out a metric_average()
on the training loss after each epoch (this part), and here's what I got:
Let me know if I can help with some more tests and I'll respond ASAP. Thanks for the help!
@ruipeterpan I think you may need to reduce the learning rate for the async-mode.
I am not sure what value is appropriate, but could you try lr/N
? N is the number of workers.
@ymjiang Here's what I got using https://github.com/bytedance/byteps/commit/7ac1dc74335b8935e4ac897e8d92d9c563fdf110. The default is 0.05 and the loss curve was still going up after setting the lr to 0.0125. I also tried out some other learning rates, and in general, the larger the learning rate, the faster the loss curve goes up.
@ymjiang It is because parameter broadcasting also becomes asynchronous. The buffer is initialized with random values as shown in the figure below.
I suggest removing the copy and initialize the buffer with zeros.
memset(stored->tensor, 0, stored->len);
But this did not completely solve the problem. Parameter broadcasting should be synchronous. Now we rely on the time delay of non-root workers to get the thing done right, like p.fill_(0)
.
@ruipeterpan Please try this commit. https://github.com/bytedance/byteps/pull/359/commits/18699f8932e404c0a8c97f847c1c06e0b4ec1fdf
@vycezhong Here's what I got using https://github.com/bytedance/byteps/commit/18699f8932e404c0a8c97f847c1c06e0b4ec1fdf with 4 workers + 1 server. I don't know if this is related, but I should note that in the first epoch in the first run, worker 0 got a loss of ~9 while all other workers got ~2.3. In the subsequent execution of the scripts, this issue was gone. The following screenshots are produced in the subsequent runs.
Here's what I got using https://github.com/bytedance/byteps/commit/18699f8932e404c0a8c97f847c1c06e0b4ec1fdf with 4 workers + 4 servers.
Thank you!
Hi, I encountered the same problem again in the current version of BPS.
[05:05:15] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[05:05:15] byteps/server/server.cc:430: BytePS server engine uses 8 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[05:05:15] src/postoffice.cc:25: Creating Van: zmq
[05:05:15] src/./zmq_van.h:66: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[05:05:15] src/./zmq_van.h:71: BYTEPS_ZMQ_NTHREADS set to 4
[05:05:15] src/van.cc:441: Bind to [[role=server, ip=172.31.41.74, port=38517, is_recovery=0, aux_id=-1]
05:05:15] src/./zmq_van.h:299: Start ZMQ recv thread
[05:05:29] src/van.cc:387: S[8] is connected to others
[05:05:29] 3rdparty/ps-lite/include/dmlc/logging.h:276: [05:05:29] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first
Stack trace returned 9 entries:
[bt] (0) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2268c) [0x7fa77e84d68c]
[bt] (1) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x22acd) [0x7fa77e84dacd]
[bt] (2) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fa77e846ea2]
[bt] (3) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fa77e848fc2]
[bt] (4) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2472e) [0x7fa77e84f72e]
[bt] (5) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x41491) [0x7fa77e86c491]
[bt] (6) /opt/conda/lib/libstdc++.so.6(+0xc9039) [0x7fa77e77f039]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa77ee116db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa77eb3a61f]
terminate called after throwing an instance of 'dmlc::Error'
what(): [05:05:29] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first
Stack trace returned 9 entries:
[bt] (0) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2268c) [0x7fa77e84d68c]
[bt] (1) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x22acd) [0x7fa77e84dacd]
[bt] (2) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fa77e846ea2]
[bt] (3) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fa77e848fc2]
[bt] (4) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2472e) [0x7fa77e84f72e]
[bt] (5) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x41491) [0x7fa77e86c491]
[bt] (6) /opt/conda/lib/libstdc++.so.6(+0xc9039) [0x7fa77e77f039]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa77ee116db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa77eb3a61f]
Aborted (core dumped)
Traceback (most recent call last):
File "/opt/conda/bin/bpslaunch", line 253, in <module>
launch_bps()
File "/opt/conda/bin/bpslaunch", line 249, in launch_bps
stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.
I also use the example provided here. example FYI, the bps setup Scheduler:
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch
Server
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export BYTEPS_SERVER_ENGINE_THREAD=8
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch
Worker 0
export NVIDIA_VISIBLE_DEVICES=0,1,2,3
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch python3 benchmark_byteps.py
Worker1
export NVIDIA_VISIBLE_DEVICES=0,1,2,3
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=<my ip>
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch python3 benchmark_byteps.py
I can run without this bug in a synchronized version, and also ok for other synchronized training codes. Is there any new bug related to the asynchronous training? And what to mention is that there is some bug in the sample code now.
_init__.py", line 398, in broadcast_optimizer_state
76: Stopping W[9]
p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
[05:17:18] src/van.cc:104: W[9] is stopped
p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
[05:17:18] src/./zmq_van.h:81: W all threads joined and destroyed
Traceback (most recent call last):
File "/opt/conda/bin/bpslaunch", line 253, in <module>
launch_bps()
File "/opt/conda/bin/bpslaunch", line 239, in launch_bps
t[i].join()
File "/opt/conda/bin/bpslaunch", line 34, in join
raise self.exc
File "/opt/conda/bin/bpslaunch", line 27, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/opt/conda/bin/bpslaunch", line 193, in worker
stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
Thanks for your attention.
Describe the bug
Turning on asynchronous training (
export BYTEPS_ENABLE_ASYNC=1
) crashes the bps server (during SendPullResponse in byteps/server/server.cc)Expected behavior
The expected behavior is for the training to go error-free, just like in synchronous training
Stack trace from the crashed server
These are produced by turning on
BYTEPS_ENABLE_GDB
, settingBYTEPS_LOG_LEVEL
to INFO andPS_VERBOSE
to 2To Reproduce
Steps to reproduce the behavior:
building the docker image
byteps set up
Environment (please complete the following information):
A few other things