ruipeterpan commented 3 years ago

Describe the bug

Turning on asynchronous training (export BYTEPS_ENABLE_ASYNC=1) crashes the bps server (during SendPullResponse in byteps/server/server.cc)

Expected behavior

The expected behavior is for the training to go error-free, just like in synchronous training

Stack trace from the crashed server

These are produced by turning on BYTEPS_ENABLE_GDB, setting BYTEPS_LOG_LEVEL to INFO and PS_VERBOSE to 2

root@025575bc0d1646e18ff8810e88261f03000000:/# bpslaunch
BytePS launching server
Command: gdb -ex 'run' -ex 'bt' -batch --args python3.8 -c 'import byteps.server'

warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[2021-01-18 03:00:55.980567: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2021-01-18 03:00:55.980615: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2021-01-18 03:00:55.980635: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2021-01-18 03:00:55.980650: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2021-01-18 03:00:55.980663: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[03:00:55] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[03:00:55] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[New Thread 0x7f0da1300700 (LWP 27)]
[03:00:55] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[03:00:55] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[New Thread 0x7f0da0aff700 (LWP 28)]
[New Thread 0x7f0da02fe700 (LWP 29)]
[New Thread 0x7f0d9fafd700 (LWP 30)]
[New Thread 0x7f0d9f2fc700 (LWP 31)]
[New Thread 0x7f0d9eafb700 (LWP 32)]
[New Thread 0x7f0d9e2fa700 (LWP 33)]
[[03:00:55] src/van.cc:389: Bind to role=server, ip=10.0.0.4, port=33261, is_recovery=0
03:00:55] src/./zmq_van.h:287: Start ZMQ recv thread
[New Thread 0x7f0d9daf9700 (LWP 34)]
[03:00:55] src/van.cc:478: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.0.0.4, port=33261, is_recovery=0 } }. THIS IS NOT DATA MSG!
[03:01:26] src/van.cc:503: 1 => 2147483647. Meta: request=0, timestamp=4, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=10.0.0.8, port=60029, is_recovery=0 role=worker, id=11, ip=10.0.0.6, port=39183, is_recovery=0 role=worker, id=13, ip=10.0.0.5, port=60011, is_recovery=0 role=server, id=8, ip=10.0.0.4, port=33261, is_recovery=0 role=worker, id=15, ip=10.0.0.4, port=52559, is_recovery=0 role=scheduler, id=1, ip=10.0.0.4, port=12345, is_recovery=0 } }. THIS IS NOT DATA MSG!
[03:01:26] src/van.cc:338: S[8] is connected to others
[New Thread 0x7f0d9d2f8700 (LWP 35)]
[03:01:26] src/van.cc:478: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[Thread 0x7f0d9d2f8700 (LWP 35) exited]
[03:01:27] src/van.cc:503: 1 => 8. Meta: request=0, timestamp=10, control={ cmd=BARRIER, barrier_group=0 }. THIS IS NOT DATA MSG!
[03:01:27] src/van.cc:478: ? => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140364908230080, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139792355468736, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140285773407680, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 13 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139663003149760, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[New Thread 0x7f0d9caf7700 (LWP 36)]
[New Thread 0x7f0d7ffff700 (LWP 37)]
[New Thread 0x7f0d7f7fe700 (LWP 38)]
[03:01:27] src/van.cc:478: ? => 15. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140364908230080
[03:01:27] src/van.cc:478: ? => 9. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139792355468736
[03:01:27] src/van.cc:478: ? => 11. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140285773407680
[03:01:27] src/van.cc:478: ? => 13. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139663003149760
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:478: ? => 11. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:478: ? => 15. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:503: 13 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:478: ? => 13. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:478: ? => 9. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] 3rdparty/ps-lite/include/dmlc/logging.h:276: [03:01:27] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 524288 first

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x229fc) [0x7f0da25bc9fc]
[bt] (1) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x22e3d) [0x7f0da25bce3d]
[bt] (2) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7f0da25b6472]
[bt] (3) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x8f2) [0x7f0da25b8562]
[bt] (4) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x24b8e) [0x7f0da25beb8e]
[bt] (5) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x41d89) [0x7f0da25dbd89]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f0da1c9a6df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0da39046db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0da3c3da3f]

[terminate called after throwing an instance of '03:01:27] src/van.cc:dmlc::Error'
503:   what():  13 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 524288 first

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x229fc) [0x7f0da25bc9fc]
[bt] (1) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x22e3d) [0x7f0da25bce3d]
[bt] (2) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7f0da25b6472]
[bt] (3) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x8f2) [0x7f0da25b8562]
[bt] (4) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x24b8e) [0x7f0da25beb8e]
[bt] (5) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x41d89) [0x7f0da25dbd89]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f0da1c9a6df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0da39046db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0da3c3da3f]

[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0

Thread 2 "python3.8" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f0da1300700 (LWP 27)]
0x00007f0da3b5af47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007f0da3b5af47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f0da3b5c8b1 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f0da1c69957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f0da1c6fae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0da1c6fb21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f0da1c6fd54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f0da25bcf27 in dmlc::LogMessageFatal::~LogMessageFatal (this=0x7f0da12ff310, __in_chrg=<optimized out>) at 3rdparty/ps-lite/include/dmlc/logging.h:277
#7  0x00007f0da25b6472 in byteps::server::SendPullResponse (type=..., type@entry=..., key=<optimized out>, req_meta=..., server=server@entry=0x2018b60) at byteps/server/server.cc:52
#8  0x00007f0da25b8562 in byteps::server::BytePSHandler (req_meta=..., req_data=..., server=0x2018b60) at byteps/server/server.cc:376
#9  0x00007f0da25beb8e in std::function<void (ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)>::operator()(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) const (__args#2=<optimized out>, __args#1=..., __args#0=..., this=0x2018dc8) at /usr/include/c++/7/bits/std_function.h:706
#10 ps::KVServer<char>::Process (this=0x2018b60, msg=...) at 3rdparty/ps-lite/include/ps/kv_app.h:516
#11 0x00007f0da25dbd89 in std::function<void (ps::Message const&)>::operator()(ps::Message const&) const (__args#0=..., this=0x2018e88) at /usr/include/c++/7/bits/std_function.h:706
#12 ps::Customer::Receiving (this=0x2018e80) at src/customer.cc:64
#13 0x00007f0da1c9a6df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f0da39046db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007f0da3c3da3f in clone () from /lib/x86_64-linux-gnu/libc.so.6
root@025575bc0d1646e18ff8810e88261f03000000:/#

To Reproduce

Steps to reproduce the behavior:

Build the docker image using a local Dockerfile. The Dockerfile is modified from the one provided in the repo with a few changes. The BytePS I used in the Dockerfile is vanilla BytePS (69a3d), only with some extra print statements.
Set up a 4-node cluster. In my case, I used Azure ML for a 4-node cluster (STANDARD_NC24, 4 x NVIDIA Tesla K80).
Start 4 worker containers (one each on the four nodes), 1 server container and 1 scheduler container on node 0. Then, run the training scripts. Note that these steps are exactly the same as those in #356 except that the async training knob (BYTEPS_ENABLE_ASYNC) is turned on. For the exact steps, see below

building the docker image

# cd into the directory with Dockerfile and do
export BYTEPS_WITHOUT_MXNET=1; export BYTEPS_WITHOUT_TENSORFLOW=1; sudo docker build -t bytepsimage/pytorch . -f Dockerfile --build-arg FRAMEWORK=pytorch

byteps set up

# scheduler on node 0
sudo docker run -dt --net=host --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=scheduler --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name scheduler bytepsimage/pytorch

# server on node 0
sudo docker run -dt --net=host --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=server --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name server bytepsimage/pytorch

# 4 workers on node 0-3
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=0 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=0 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker0 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=1 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=1 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker1 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=2 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=2 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker2 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=3 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=3 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker3 bytepsimage/pytorch

# for each node, go inside the container (server, scheduler, worker 0, ..., worker 3)
sudo docker exec -it {replace_me_with_name_of_container} bash

# use bpslaunch to start the server/scheduler. For the workers, 
bpslaunch python3.8 bps_issue.py --bps -pindex 0
bpslaunch python3.8 bps_issue.py --bps -pindex 1
bpslaunch python3.8 bps_issue.py --bps -pindex 2
bpslaunch python3.8 bps_issue.py --bps -pindex 3

Environment (please complete the following information):

OS: Linux 4.15.0-1100-azure x86_64
GCC version: Ubuntu 5.4.0-6ubuntu1~16.04.12
CUDA and NCCL version: CUDA 11.0, NCCL 2.7.8-1
Framework (TF, PyTorch, MXNet): PyTorch

A few other things

Thank you in advance for your time to look at this issue, and thank you for open-sourcing this and being so responsive in the issues section!

ymjiang commented 3 years ago

For now, the asynchronous mode for PyTorch is supported only when you use the DistributedOptimizer approach, like this example. Your code now uses the DDP wrapper, for which we haven't implemented the async-mode yet.

ruipeterpan commented 3 years ago

Hey @ymjiang thanks for the info! Nevertheless, after switching to using benchmark_byteps.py, the issue is still there.

FYI:

Here's the updated Dockerfile, the only difference with the official one being the cudnn version (8.0.5.39-1+cuda11.0) and nccl version (2.7.8-1+cuda11.0)
I used bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000 on the workers

The stack trace from the server:


root@536f767106da4e48b0f29957b21b64da000000:/# bpslaunch
BytePS launching server
Command: python3 -c 'import byteps.server'

[06:22:03] byteps/server/server.cc:419: BytePS server is enabled asynchronous training [06:22:03] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [06:22:03] src/postoffice.cc:25: Creating Van: zmq [06:22:03] src/./zmq_van.h:299: Start ZMQ recv thread [06:22:15] 3rdparty/ps-lite/include/dmlc/logging.h:276: [06:22:15] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries: [bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fe2ba4d699c] [bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fe2ba4d6ddd] [bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer)+0x2b2) [0x7fe2ba4d01b2] [bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs const&, ps::KVServer)+0x912) [0x7fe2ba4d22d2] [bt] (4) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x24b9e) [0x7fe2ba4d8b9e] [bt] (5) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x41781) [0x7fe2ba4f5781] [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fe2b9bb46df] [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe2bb7e56db] [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe2bbb1ea3f]

terminate called after throwing an instance of 'dmlc::Error' what(): [06:22:15] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries: [bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fe2ba4d699c] [bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fe2ba4d6ddd] [bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer)+0x2b2) [0x7fe2ba4d01b2] [bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs const&, ps::KVServer)+0x912) [0x7fe2ba4d22d2] [bt] (4) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x24b9e) [0x7fe2ba4d8b9e] [bt] (5) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x41781) [0x7fe2ba4f5781] [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fe2b9bb46df] [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe2bb7e56db] [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe2bbb1ea3f]

Aborted (core dumped) Traceback (most recent call last): File "/usr/local/bin/bpslaunch", line 4, in import('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch') File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 658, in run_script self.require(requires)[0].run_script(script_name, ns) File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1438, in run_script exec(code, namespace, namespace) File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in launch_bps() File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 216, in launch_bps stdout=sys.stdout, stderr=sys.stderr, shell=True) File "/usr/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.

jasperzhong commented 3 years ago

This error indicates that you did not kill the previous bpslaunch process. @ruipeterpan

pkill bpslaunch; pkill python3

ruipeterpan commented 3 years ago

@vycezhong Thanks for your help. I double-checked to make sure no bps-related processes are alive (both inside and outside of all containers) before launching the server, yet it still crashes. Am I doing something dumb?

jasperzhong commented 3 years ago

@ruipeterpan Did you launch any worker in your example above? The server should do nothing but wait when there is no worker.

ruipeterpan commented 3 years ago

@vycezhong If only the server gets launched, it starts ZMQ recv thread and waits w/o an error. As soon as the workers are launched, the server crashes.

jasperzhong commented 3 years ago

OK I can reproduce it. I will look at it.

ymjiang commented 3 years ago

@vycezhong Could this problem be related to https://github.com/bytedance/byteps/pull/225?

My experience is that v0.2.4 worked well (e.g., see https://github.com/bytedance/byteps/issues/271).

jasperzhong commented 3 years ago

@vycezhong Could this problem be related to #225?

My experience is that v0.2.4 worked well (e.g., see #271).

Yes. It is because I use update_buf for pulling. I will fix it then.

ymjiang commented 3 years ago

@ruipeterpan Could you please test if https://github.com/bytedance/byteps/pull/359 fixes your issue?

ruipeterpan commented 3 years ago

@vycezhong thanks for the fix! The server-crashing issue is resolved by #359, but I'm seeing some weird behavior for the training loss curve after applying the changes in the PR. I'll spend some time double-checking to make sure it's not a problem on my end.

jasperzhong commented 3 years ago

@ruipeterpan You also need to enable async for workers.

ruipeterpan commented 3 years ago

@ruipeterpan You also need to enable async for workers.

I had already toggled BYTEPS_ENABLE_ASYNC for all workers, servers & the scheduler for both async mode and sync mode.

jasperzhong commented 3 years ago

@ymjiang I think it should be Parameter here? AsyncParam in servers will be initialized with random values.

https://github.com/bytedance/byteps/blob/6957bc38a112f10cd0bbef576de97719d9fce1a5/byteps/torch/__init__.py#L204

ymjiang commented 3 years ago

@ymjiang I think it should be Parameter here? AsyncParam in servers will be initialized with random values.

https://github.com/bytedance/byteps/blob/6957bc38a112f10cd0bbef576de97719d9fce1a5/byteps/torch/__init__.py#L204

I do not get it. The stored buffer will be initialized with the first incoming recv, which contains the value of the parameters.

https://github.com/bytedance/byteps/blob/6957bc38a112f10cd0bbef576de97719d9fce1a5/byteps/server/server.cc#L284

ymjiang commented 3 years ago

@ruipeterpan Would you test v0.2.4? My loss curve with v0.2.4 seems fine (it is at least decreasing).

jasperzhong commented 3 years ago

The first incoming recv should be random values.

https://github.com/bytedance/byteps/blob/6957bc38a112f10cd0bbef576de97719d9fce1a5/byteps/common/operations.cc#L357

ruipeterpan commented 3 years ago

@ymjiang Here's the loss curve I got for both sync and async using v0.2.4 (809ef20)

jasperzhong commented 3 years ago

@ruipeterpan Please try this commit. https://github.com/bytedance/byteps/pull/359/commits/7ac1dc74335b8935e4ac897e8d92d9c563fdf110

ruipeterpan commented 3 years ago

@vycezhong Here's what I got using https://github.com/bytedance/byteps/commit/7ac1dc74335b8935e4ac897e8d92d9c563fdf110 and the original scripts (bps_issue.py) I provided:

Then I commented out a metric_average() on the training loss after each epoch (this part), and here's what I got:

Let me know if I can help with some more tests and I'll respond ASAP. Thanks for the help!

ymjiang commented 3 years ago

@ruipeterpan I think you may need to reduce the learning rate for the async-mode.

I am not sure what value is appropriate, but could you try lr/N? N is the number of workers.

ruipeterpan commented 3 years ago

@ymjiang Here's what I got using https://github.com/bytedance/byteps/commit/7ac1dc74335b8935e4ac897e8d92d9c563fdf110. The default is 0.05 and the loss curve was still going up after setting the lr to 0.0125. I also tried out some other learning rates, and in general, the larger the learning rate, the faster the loss curve goes up.

jasperzhong commented 3 years ago

@ymjiang It is because parameter broadcasting also becomes asynchronous. The buffer is initialized with random values as shown in the figure below.

I suggest removing the copy and initialize the buffer with zeros.

https://github.com/bytedance/byteps/blob/6957bc38a112f10cd0bbef576de97719d9fce1a5/byteps/server/server.cc#L284-L285

memset(stored->tensor, 0, stored->len);

But this did not completely solve the problem. Parameter broadcasting should be synchronous. Now we rely on the time delay of non-root workers to get the thing done right, like p.fill_(0).

https://github.com/bytedance/byteps/blob/6957bc38a112f10cd0bbef576de97719d9fce1a5/byteps/torch/__init__.py#L283-L290

@ruipeterpan Please try this commit. https://github.com/bytedance/byteps/pull/359/commits/18699f8932e404c0a8c97f847c1c06e0b4ec1fdf

ruipeterpan commented 3 years ago

@vycezhong Here's what I got using https://github.com/bytedance/byteps/commit/18699f8932e404c0a8c97f847c1c06e0b4ec1fdf with 4 workers + 1 server. I don't know if this is related, but I should note that in the first epoch in the first run, worker 0 got a loss of ~9 while all other workers got ~2.3. In the subsequent execution of the scripts, this issue was gone. The following screenshots are produced in the subsequent runs.

Here's what I got using https://github.com/bytedance/byteps/commit/18699f8932e404c0a8c97f847c1c06e0b4ec1fdf with 4 workers + 4 servers.

Thank you!

tonyzhao-jt commented 2 years ago

Hi, I encountered the same problem again in the current version of BPS.

[05:05:15] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[05:05:15] byteps/server/server.cc:430: BytePS server engine uses 8 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[05:05:15] src/postoffice.cc:25: Creating Van: zmq
[05:05:15] src/./zmq_van.h:66: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[05:05:15] src/./zmq_van.h:71: BYTEPS_ZMQ_NTHREADS set to 4
[05:05:15] src/van.cc:441: Bind to [[role=server, ip=172.31.41.74, port=38517, is_recovery=0, aux_id=-1]
05:05:15] src/./zmq_van.h:299: Start ZMQ recv thread
[05:05:29] src/van.cc:387: S[8] is connected to others
[05:05:29] 3rdparty/ps-lite/include/dmlc/logging.h:276: [05:05:29] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries:
[bt] (0) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2268c) [0x7fa77e84d68c]
[bt] (1) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x22acd) [0x7fa77e84dacd]
[bt] (2) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fa77e846ea2]
[bt] (3) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fa77e848fc2]
[bt] (4) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2472e) [0x7fa77e84f72e]
[bt] (5) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x41491) [0x7fa77e86c491]
[bt] (6) /opt/conda/lib/libstdc++.so.6(+0xc9039) [0x7fa77e77f039]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa77ee116db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa77eb3a61f]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [05:05:29] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries:
[bt] (0) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2268c) [0x7fa77e84d68c]
[bt] (1) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x22acd) [0x7fa77e84dacd]
[bt] (2) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fa77e846ea2]
[bt] (3) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fa77e848fc2]
[bt] (4) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2472e) [0x7fa77e84f72e]
[bt] (5) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x41491) [0x7fa77e86c491]
[bt] (6) /opt/conda/lib/libstdc++.so.6(+0xc9039) [0x7fa77e77f039]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa77ee116db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa77eb3a61f]

Aborted (core dumped)
Traceback (most recent call last):
  File "/opt/conda/bin/bpslaunch", line 253, in <module>
    launch_bps()
  File "/opt/conda/bin/bpslaunch", line 249, in launch_bps
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.

I also use the example provided here. example FYI, the bps setup Scheduler:

export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234  # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch

Server

export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234  # the scheduler port
export BYTEPS_SERVER_ENGINE_THREAD=8
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch

Worker 0

export NVIDIA_VISIBLE_DEVICES=0,1,2,3
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch python3 benchmark_byteps.py

Worker1

export NVIDIA_VISIBLE_DEVICES=0,1,2,3
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=<my ip>
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch python3 benchmark_byteps.py

I can run without this bug in a synchronized version, and also ok for other synchronized training codes. Is there any new bug related to the asynchronous training? And what to mention is that there is some bug in the sample code now.

_init__.py", line 398, in broadcast_optimizer_state
76: Stopping W[9]
    p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
[05:17:18] src/van.cc:104: W[9] is stopped
    p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
[05:17:18] src/./zmq_van.h:81: W all threads joined and destroyed
Traceback (most recent call last):
  File "/opt/conda/bin/bpslaunch", line 253, in <module>
    launch_bps()
  File "/opt/conda/bin/bpslaunch", line 239, in launch_bps
    t[i].join()
  File "/opt/conda/bin/bpslaunch", line 34, in join
    raise self.exc
  File "/opt/conda/bin/bpslaunch", line 27, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/opt/conda/bin/bpslaunch", line 193, in worker
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)

Thanks for your attention.

bytedance / byteps

Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server #357

Describe the bug

Expected behavior

Stack trace from the crashed server

To Reproduce

building the docker image

byteps set up

Environment (please complete the following information):

A few other things