bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 488 forks source link

Mxnet test case failed #190

Closed jasperzhong closed 4 years ago

jasperzhong commented 4 years ago

Describe the bug

tests/test_mxnet.py

add test test_byteps_push_pull_inplace

if __name__ == '__main__':
    mxtest = MXTest()
    bps.init()
    mxtest.test_byteps_push_pull()
    mxtest.test_byteps_trainer_param_order()
    #mxtest.test_byteps_broadcast()
    mxtest.test_byteps_push_pull_inplace() #SEE HERE 
    # TODO: fix shutdown issue https://github.com/bytedance/byteps/issues/177
    # bps.shutdown()
export BYTEPS_LOG_LEVEL=DEBUG
export BYTEPS_FORCE_DISTRIBUTED=1
export DMLC_ROLE=worker
export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx
export DMLC_PS_ROOT_PORT=1234
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=1
export NVIDIA_VISIBLE_DEVICES=0,1,2,3

And I will get this error:

Traceback (most recent call last):
    File "/usr/local/byteps/tests/test_mxnet.py", line 189, in <module>
        mxtest.test_byteps_push_pull_inplace()
    File "/usr/local/byteps/tests/test_mxnet.py", line 134, in test_byteps_push_pull_inplace
AssertionError: bps.byteps_push_pull procuces. 
     incorrect results for self

log

BytePS launching worker

!!!Enable profiling for WORKER_ID: 0 and local_rank: 0!!!
BYTEPS_TRACE_START_STEP:    BYTEPS_TRACE_END_STEP:   BYTEPS_TRACE_DIR: 
Command: python3 /usr/local/byteps/tests/test_mxnet.py

!!!Enable profiling for WORKER_ID: 0 and local_rank: 1!!!
BYTEPS_TRACE_START_STEP:    BYTEPS_TRACE_END_STEP:   BYTEPS_TRACE_DIR: 
Command: python3 /usr/local/byteps/tests/test_mxnet.py

!!!Enable profiling for WORKER_ID: 0 and local_rank: 2!!!
BYTEPS_TRACE_START_STEP:    BYTEPS_TRACE_END_STEP:   BYTEPS_TRACE_DIR: 
Command: python3 /usr/local/byteps/tests/test_mxnet.py

!!!Enable profiling for WORKER_ID: 0 and local_rank: 3!!!
BYTEPS_TRACE_START_STEP:    BYTEPS_TRACE_END_STEP:   BYTEPS_TRACE_DIR: 
Command: python3 /usr/local/byteps/tests/test_mxnet.py

[2020-01-13 06:36:23.776870: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:36:23.776992: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_3
[2020-01-13 06:36:23.777035: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_3
[2020-01-13 06:36:23.777080: D byteps/common/communicator.cc:121] This is ROOT device, rank=3, all sockets create successfully
[2020-01-13 06:36:23.777094: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:36:23.777103: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:36:23.777112: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:36:23.777128: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:36:23.777137: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:36:23.777144: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:36:23.777147: D byteps/common/communicator.cc:164] Listening on socket 3
[2020-01-13 06:36:23.777219: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl3
[2020-01-13 06:36:23.777257: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl3
[2020-01-13 06:36:23.777296: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=3, all sockets create successfully
[2020-01-13 06:36:23.777308: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:36:23.777325: D byteps/common/communicator.cc:164] Listening on socket 3
[2020-01-13 06:36:24.284394: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:36:24.284588: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_1
[2020-01-13 06:36:24.284657: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_1
[2020-01-13 06:36:24.284678: D byteps/common/communicator.cc:121] This is WORKER device, rank=1, all sockets create successfully
[2020-01-13 06:36:24.284696: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:36:24.284715: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:36:24.284731: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:36:24.284762: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:36:24.284779: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:36:24.284795: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:36:24.284864: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl1
[2020-01-13 06:36:24.284923: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl1
[2020-01-13 06:36:24.284943: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=1, all sockets create successfully
[2020-01-13 06:36:24.284958: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:36:24.367309: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:36:24.367538: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_0
[2020-01-13 06:36:24.367611: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_0
[2020-01-13 06:36:24.367649: D byteps/common/communicator.cc:121] This is WORKER device, rank=0, all sockets create successfully
[2020-01-13 06:36:24.367672: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:36:24.367692: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:36:24.367711: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:36:24.367743: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:36:24.367763: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:36:24.367780: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:36:24.367867: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl0
[2020-01-13 06:36:24.367930: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl0
[2020-01-13 06:36:24.367950: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=0, all sockets create successfully
[2020-01-13 06:36:24.367967: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:36:24.459078: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:36:24.459305: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_2
[2020-01-13 06:36:24.459383: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_2
[2020-01-13 06:36:24.459405: D byteps/common/communicator.cc:121] This is WORKER device, rank=2, all sockets create successfully
[2020-01-13 06:36:24.459424: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:36:24.459444: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:36:24.459462: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:36:24.459497: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:36:24.459516: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:36:24.459532: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:36:24.459607: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl2
[2020-01-13 06:36:24.459674: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl2
[2020-01-13 06:36:24.459694: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=2, all sockets create successfully
[2020-01-13 06:36:24.459711: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:36:27.476497: D byteps/common/nccl_manager.cc:104] root nccl_id is 1458356680124661762
[2020-01-13 06:36:27.864698: D byteps/common/nccl_manager.cc:114] recv nccl_id is 1458356680124661762, local_rank=2
[2020-01-13 06:36:27.965359: D byteps/common/nccl_manager.cc:114] recv nccl_id is 1458356680124661762, local_rank=1
[2020-01-13 06:36:28.124460: D byteps/common/nccl_manager.cc:114] recv nccl_id is 1458356680124661762, local_rank=0
[2020-01-13 06:36:28.922971: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:36:28.923055: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:36:28.923075: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:36:28.923090: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:36:28.923105: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:36:28.923119: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:36:28.923134: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:36:28.923149: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:36:28.923164: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:36:28.923178: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:36:28.923195: D byteps/common/global.cc:234] Inited rank=2 local_rank=2 size=4 local_size=4 worker_id=0
[2020-01-13 06:36:28.923744: D byteps/common/global.cc:266] Started 8 background threads. rank=2
[2020-01-13 06:36:28.927974: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:36:28.928035: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:36:28.928054: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:36:28.928070: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:36:28.928095: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:36:28.928121: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:36:28.928141: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:36:28.928118: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:36:28.928156: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:36:28.928173: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:36:28.928175: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:36:28.928188: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:36:28.928197: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:36:28.928206: D byteps/common/global.cc:234] Inited rank=3 local_rank=3 size=4 local_size=4 worker_id=0
[2020-01-13 06:36:28.928214: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:36:28.928231: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:36:28.928246: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:36:28.928261: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:36:28.928276: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:36:28.928291: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:36:28.928307: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:36:28.928324: D byteps/common/global.cc:234] Inited rank=1 local_rank=1 size=4 local_size=4 worker_id=0
[2020-01-13 06:36:28.928581: D byteps/common/global.cc:266] Started 6 background threads. rank=3
[2020-01-13 06:36:28.928745: D byteps/common/global.cc:266] Started 8 background threads. rank=1
[2020-01-13 06:36:28.930599: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:36:28.930674: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:36:28.930710: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:36:28.930728: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:36:28.930746: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:36:28.930763: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:36:28.930780: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:36:28.930797: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:36:28.930828: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:36:28.930838: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:36:28.930850: D byteps/common/global.cc:234] Inited rank=0 local_rank=0 size=4 local_size=4 worker_id=0
[2020-01-13 06:36:28.931318: D byteps/common/global.cc:266] Started 8 background threads. rank=0
[2020-01-13 06:36:29.856225: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=2
[2020-01-13 06:36:29.856402: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=2
[2020-01-13 06:36:29.991027: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=0
[2020-01-13 06:36:29.991128: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=0
[2020-01-13 06:36:30. 44476: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=1
[2020-01-13 06:36:30. 44627: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=1
[2020-01-13 06:36:30. 72022: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=3
[2020-01-13 06:36:30. 72129: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=3
[06:36:30] src/./zmq_van.h:287: Start ZMQ recv thread
[06:36:30] src/./zmq_van.h:287: Start ZMQ recv thread
[06:36:30] src/./zmq_van.h:287: Start ZMQ recv thread
[2020-01-13 06:36:30.175219: D byteps/common/global.cc:534] key 0 assigned to server 0, accumulated workload for this server is 68
[2020-01-13 06:36:30.199473: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=3
[2020-01-13 06:36:30.199546: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=3
[2020-01-13 06:36:30.199581: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=3
[2020-01-13 06:36:30.199620: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=3
[2020-01-13 06:36:30.199840: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=3
[2020-01-13 06:36:30.199865: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:36:30.200667: D byteps/common/global.cc:534] key 65536 assigned to server 0, accumulated workload for this server is 468
[2020-01-13 06:36:30.208144: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=0
[2020-01-13 06:36:30.208218: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=0
[2020-01-13 06:36:30.208251: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=0
[2020-01-13 06:36:30.208287: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=0
[2020-01-13 06:36:30.208331: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=0
[2020-01-13 06:36:30.208346: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:36:30.219204: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=1
[2020-01-13 06:36:30.219338: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=1
[2020-01-13 06:36:30.219444: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=1
[2020-01-13 06:36:30.219564: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=1
[2020-01-13 06:36:30.221256: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=1
[2020-01-13 06:36:30.221307: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:36:30.222320: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=2
[2020-01-13 06:36:30.222426: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=2
[2020-01-13 06:36:30.222478: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=2
[2020-01-13 06:36:30.222535: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=2
[2020-01-13 06:36:30.224739: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=2
[2020-01-13 06:36:30.224797: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:36:30.232399: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=0
[2020-01-13 06:36:30.232431: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:36:30.232446: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=1
[2020-01-13 06:36:30.232547: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:36:30.232724: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=2
[2020-01-13 06:36:30.232775: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:36:30.232987: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=3
[2020-01-13 06:36:30.233015: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:36:30.236672: D byteps/common/global.cc:534] key 196608 assigned to server 0, accumulated workload for this server is 868
[2020-01-13 06:36:30.245615: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=2
[2020-01-13 06:36:30.245704: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=2
[2020-01-13 06:36:30.247157: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=1
[2020-01-13 06:36:30.247306: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=1
[2020-01-13 06:36:30.248285: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=0
[2020-01-13 06:36:30.248330: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=0
[2020-01-13 06:36:30.252102: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=3
[2020-01-13 06:36:30.252221: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=3
[2020-01-13 06:36:30.253121: D byteps/common/global.cc:534] key 327680 assigned to server 0, accumulated workload for this server is 936
tensor before push_pull: 
[ 44.921947  -22.404182   19.133305   13.094497   11.725103  -69.86146
   0.1105547  33.98166    92.52082   -85.41071    16.735744   24.991835
 -77.336075   14.822626   25.756227   43.89869    58.843494 ]
<NDArray 17 @gpu(3)>
tensor after push_pull: 
[ -2.2648907 -26.942665   -3.1381292  30.620129   14.483914  -24.707336
  57.58606    17.05282     9.471867  -14.655745  -26.668665    4.1452713
  29.551579   31.442118   47.59712     5.5366697   6.2108374]
<NDArray 17 @gpu(3)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
self 201 int32 1 
[288]
<NDArray 1 @gpu(3)> 0
tensor 3 
[ 83  72   0 -36  39 -77 -22 -96 -43 -28   4  21  10 -20  75  24 -43]
<NDArray 17 @gpu(3)>
multiplied 3 
[ 332  288    0 -144  156 -308  -88 -384 -172 -112   16   84   40  -80
  300   96 -172]
<NDArray 17 @gpu(3)>
tensor before push_pull: 
[ 13.224661     0.13791323 -88.42125     25.27721     13.318825
  75.00076     74.871086   -56.566643   -52.70393     -7.9426765
 -81.11471    -57.324112    42.22908     79.71407     18.322575
   6.675613    56.61081   ]
<NDArray 17 @gpu(1)>
tensor after push_pull: 
[ -2.2648907 -26.942665   -3.1381292  30.620129   14.483914  -24.707336
  57.58606    17.05282     9.471867  -14.655745  -26.668665    4.1452713
  29.551579   31.442118   47.59712     5.5366697   6.2108374]
<NDArray 17 @gpu(1)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
self 201 int32 1 
[288]
<NDArray 1 @gpu(1)> 0
tensor 1 
[ 83  72   0 -36  39 -77 -22 -96 -43 -28   4  21  10 -20  75  24 -43]
<NDArray 17 @gpu(1)>
multiplied 1 
[ 332  288    0 -144  156 -308  -88 -384 -172 -112   16   84   40  -80
  300   96 -172]
<NDArray 17 @gpu(1)>
tensor before push_pull: 
[-88.276085    9.469676  -28.467728   32.062294  -16.11681   -62.981438
  83.053955   19.23256    15.465963   89.03272    44.934155   -7.217002
  57.025764  -39.45068    89.96457   -55.9767      7.7130795]
<NDArray 17 @gpu(0)>
tensor after push_pull: 
[ -2.2648907 -26.942665   -3.1381292  30.620129   14.483914  -24.707336
  57.58606    17.05282     9.471867  -14.655745  -26.668665    4.1452713
  29.551579   31.442118   47.59712     5.5366697   6.2108374]
<NDArray 17 @gpu(0)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
self 201 int32 1 
[288]
<NDArray 1 @gpu(0)> 0
tensor 0 
[ 83  72   0 -36  39 -77 -22 -96 -43 -28   4  21  10 -20  75  24 -43]
<NDArray 17 @gpu(0)>
multiplied 0 
[ 332  288    0 -144  156 -308  -88 -384 -172 -112   16   84   40  -80
  300   96 -172]
<NDArray 17 @gpu(0)>
tensor before push_pull: 
[ 21.069908 -94.97407   85.203156  52.046513  49.008537 -40.987206
  72.30864   71.563705 -17.39539  -54.30231  -87.22985   56.130363
  96.28754   70.68246   56.345104  27.549076 -98.324036]
<NDArray 17 @gpu(2)>
tensor after push_pull: 
[ -2.2648907 -26.942665   -3.1381292  30.620129   14.483914  -24.707336
  57.58606    17.05282     9.471867  -14.655745  -26.668665    4.1452713
  29.551579   31.442118   47.59712     5.5366697   6.2108374]
<NDArray 17 @gpu(2)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
self 201 int32 1 
[288]
<NDArray 1 @gpu(2)> 0
tensor 2 
[ 83  72   0 -36  39 -77 -22 -96 -43 -28   4  21  10 -20  75  24 -43]
<NDArray 17 @gpu(2)>
multiplied 2 
[ 332  288    0 -144  156 -308  -88 -384 -172 -112   16   84   40  -80
  300   96 -172]
<NDArray 17 @gpu(2)>
[2020-01-13 06:36:30.273508: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=0)
[2020-01-13 06:36:30.285661: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=1)
[2020-01-13 06:36:30.306601: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=2)
[2020-01-13 06:36:30.306861: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=3)
[2020-01-13 06:36:30.308988: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:36:33.253281: D byteps/common/communicator.cc:203] listen thread joined (rank=3)
[2020-01-13 06:36:33.257222: D byteps/common/communicator.cc:203] listen thread joined (rank=3)
[2020-01-13 06:36:33.257421: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=3)
[2020-01-13 06:36:33.257491: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=3)
[2020-01-13 06:36:33.257515: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:36:33.257531: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=3)
[2020-01-13 06:36:33.257544: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
Traceback (most recent call last):
  File "/usr/local/byteps/tests/test_mxnet.py", line 189, in <module>
    mxtest.test_byteps_push_pull_inplace()
  File "/usr/local/byteps/tests/test_mxnet.py", line 134, in test_byteps_push_pull_inplace
    incorrect results for self'
AssertionError: bps.byteps_push_pull produces                                                  incorrect results for self
[2020-01-13 06:36:33.262659: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:36:33.262777: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=2)
[2020-01-13 06:36:33.262825: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=2)
[2020-01-13 06:36:33.262804: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:36:33.262846: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:36:33.262866: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=2)
[2020-01-13 06:36:33.262884: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
[2020-01-13 06:36:33.262841: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:36:33.262907: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=1)
[2020-01-13 06:36:33.262962: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=0)
[2020-01-13 06:36:33.262972: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=1)
[2020-01-13 06:36:33.262995: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:36:33.263015: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=1)
[2020-01-13 06:36:33.263032: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
[2020-01-13 06:36:33.263031: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=0)
[2020-01-13 06:36:33.263061: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:36:33.263081: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=0)
[2020-01-13 06:36:33.263099: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
Traceback (most recent call last):
  File "/usr/local/byteps/tests/test_mxnet.py", line 189, in <module>
    mxtest.test_byteps_push_pull_inplace()
  File "/usr/local/byteps/tests/test_mxnet.py", line 134, in test_byteps_push_pull_inplace
    incorrect results for self'
AssertionError: bps.byteps_push_pull produces                                                  incorrect results for self
Traceback (most recent call last):
  File "/usr/local/byteps/tests/test_mxnet.py", line 189, in <module>
    mxtest.test_byteps_push_pull_inplace()
  File "/usr/local/byteps/tests/test_mxnet.py", line 134, in test_byteps_push_pull_inplace
    incorrect results for self'
AssertionError: bps.byteps_push_pull produces                                                  incorrect results for self
Traceback (most recent call last):
  File "/usr/local/byteps/tests/test_mxnet.py", line 189, in <module>
    mxtest.test_byteps_push_pull_inplace()
  File "/usr/local/byteps/tests/test_mxnet.py", line 134, in test_byteps_push_pull_inplace
    incorrect results for self'
AssertionError: bps.byteps_push_pull produces                                                  incorrect results for self
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 /usr/local/byteps/tests/test_mxnet.py' returned non-zero exit status 1.

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 /usr/local/byteps/tests/test_mxnet.py' returned non-zero exit status 1.

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 /usr/local/byteps/tests/test_mxnet.py' returned non-zero exit status 1.

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 /usr/local/byteps/tests/test_mxnet.py' returned non-zero exit status 1.

To Reproduce Steps to reproduce the behavior:

  1. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

use docker/Dockerfile.mxnet

Additional context Add any other context about the problem here.


IMO, this line should be removed. This is because tensors are averaged by default. (mxnet/ops.py)

jasperzhong commented 4 years ago

I change that line like this and it seems allright.

multiplied = tensor.copy()
BytePS launching worker

!!!Enable profiling for WORKER_ID: 0 and local_rank: 0!!!
BYTEPS_TRACE_START_STEP:    BYTEPS_TRACE_END_STEP:   BYTEPS_TRACE_DIR: 
Command: python3 /usr/local/byteps/tests/test_mxnet.py

!!!Enable profiling for WORKER_ID: 0 and local_rank: 1!!!
BYTEPS_TRACE_START_STEP:    BYTEPS_TRACE_END_STEP:   BYTEPS_TRACE_DIR: 
Command: python3 /usr/local/byteps/tests/test_mxnet.py

!!!Enable profiling for WORKER_ID: 0 and local_rank: 2!!!
BYTEPS_TRACE_START_STEP:    BYTEPS_TRACE_END_STEP:   BYTEPS_TRACE_DIR: 
Command: python3 /usr/local/byteps/tests/test_mxnet.py

!!!Enable profiling for WORKER_ID: 0 and local_rank: 3!!!
BYTEPS_TRACE_START_STEP:    BYTEPS_TRACE_END_STEP:   BYTEPS_TRACE_DIR: 
Command: python3 /usr/local/byteps/tests/test_mxnet.py

[2020-01-13 06:59:48.125568: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:59:48.125768: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_3
[2020-01-13 06:59:48.125820: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_3
[2020-01-13 06:59:48.125875: D byteps/common/communicator.cc:121] This is ROOT device, rank=3, all sockets create successfully
[2020-01-13 06:59:48.125890: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:59:48.125902: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:59:48.125913: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:59:48.125933: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:59:48.125943: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:59:48.125953: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:59:48.126002: D byteps/common/communicator.cc:164] Listening on socket 3
[2020-01-13 06:59:48.126034: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:59:48.126012: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl3
[2020-01-13 06:59:48.126229: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_2
[2020-01-13 06:59:48.126347: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl3
[2020-01-13 06:59:48.126363: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_2
[2020-01-13 06:59:48.126349: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:59:48.126387: D byteps/common/communicator.cc:121] This is WORKER device, rank=2, all sockets create successfully
[2020-01-13 06:59:48.126409: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:59:48.126427: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:59:48.126445: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:59:48.126451: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=3, all sockets create successfully
[2020-01-13 06:59:48.126446: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:59:48.126475: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:59:48.126473: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_1
[2020-01-13 06:59:48.126480: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:59:48.126496: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:59:48.126514: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:59:48.126556: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_0
[2020-01-13 06:59:48.126571: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_1
[2020-01-13 06:59:48.126589: D byteps/common/communicator.cc:121] This is WORKER device, rank=1, all sockets create successfully
[2020-01-13 06:59:48.126603: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:59:48.126616: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:59:48.126630: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:59:48.126601: D byteps/common/communicator.cc:164] Listening on socket 3
[2020-01-13 06:59:48.126624: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl2
[2020-01-13 06:59:48.126637: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_0
[2020-01-13 06:59:48.126666: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:59:48.126674: D byteps/common/communicator.cc:121] This is WORKER device, rank=0, all sockets create successfully
[2020-01-13 06:59:48.126682: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:59:48.126685: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:59:48.126695: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:59:48.126697: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:59:48.126708: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:59:48.126724: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:59:48.126734: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:59:48.126728: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl2
[2020-01-13 06:59:48.126742: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:59:48.126750: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=2, all sockets create successfully
[2020-01-13 06:59:48.126764: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl1
[2020-01-13 06:59:48.126767: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:59:48.126792: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl0
[2020-01-13 06:59:48.126824: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl1
[2020-01-13 06:59:48.126840: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=1, all sockets create successfully
[2020-01-13 06:59:48.126847: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl0
[2020-01-13 06:59:48.126852: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:59:48.126859: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=0, all sockets create successfully
[2020-01-13 06:59:48.126868: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:59:51.732474: D byteps/common/nccl_manager.cc:104] root nccl_id is 1458356679286587394
[2020-01-13 06:59:51.732613: D byteps/common/nccl_manager.cc:114] recv nccl_id is 1458356679286587394, local_rank=2
[2020-01-13 06:59:51.761774: D byteps/common/nccl_manager.cc:114] recv nccl_id is 1458356679286587394, local_rank=0
[2020-01-13 06:59:51.967585: D byteps/common/nccl_manager.cc:114] recv nccl_id is 1458356679286587394, local_rank=1
[2020-01-13 06:59:52.973934: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:59:52.974010: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:59:52.974029: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:59:52.974045: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:59:52.974060: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:59:52.974075: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:59:52.974089: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:59:52.974104: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:59:52.974119: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:59:52.974134: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:59:52.974150: D byteps/common/global.cc:234] Inited rank=0 local_rank=0 size=4 local_size=4 worker_id=0
[2020-01-13 06:59:52.974522: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:59:52.974593: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:59:52.974614: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:59:52.974630: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:59:52.974646: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:59:52.974660: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:59:52.974675: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:59:52.974690: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:59:52.974705: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:59:52.974720: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:59:52.974736: D byteps/common/global.cc:234] Inited rank=3 local_rank=3 size=4 local_size=4 worker_id=0
[2020-01-13 06:59:52.974773: D byteps/common/global.cc:266] Started 8 background threads. rank=0
[2020-01-13 06:59:52.974868: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:59:52.974926: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:59:52.974946: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:59:52.974962: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:59:52.974977: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:59:52.974999: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:59:52.975015: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:59:52.975030: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:59:52.975045: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:59:52.975060: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:59:52.975077: D byteps/common/global.cc:234] Inited rank=1 local_rank=1 size=4 local_size=4 worker_id=0
[2020-01-13 06:59:52.975117: D byteps/common/global.cc:266] Started 6 background threads. rank=3
[2020-01-13 06:59:52.975288: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:59:52.975347: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:59:52.975369: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:59:52.975385: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:59:52.975400: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:59:52.975415: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:59:52.975437: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:59:52.975453: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:59:52.975468: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:59:52.975483: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:59:52.975500: D byteps/common/global.cc:234] Inited rank=2 local_rank=2 size=4 local_size=4 worker_id=0
[2020-01-13 06:59:52.975634: D byteps/common/global.cc:266] Started 8 background threads. rank=1
[2020-01-13 06:59:52.976382: D byteps/common/global.cc:266] Started 8 background threads. rank=2
[2020-01-13 06:59:53.994247: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=0
[2020-01-13 06:59:53.994377: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=0
[2020-01-13 06:59:54.132045: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=1
[2020-01-13 06:59:54.132193: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=1
[2020-01-13 06:59:54.138448: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=2
[2020-01-13 06:59:54.138555: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=2
[2020-01-13 06:59:54.142282: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=3
[2020-01-13 06:59:54.142466: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=3
[06:59:54] src/./zmq_van.h:287: Start ZMQ recv thread
[06:59:54] src/./zmq_van.h:287: Start ZMQ recv thread
[06:59:54] src/./zmq_van.h:287: Start ZMQ recv thread
[2020-01-13 06:59:54.246060: D byteps/common/global.cc:534] key 0 assigned to server 0, accumulated workload for this server is 68
[2020-01-13 06:59:54.266007: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=0
[2020-01-13 06:59:54.266093: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=0
[2020-01-13 06:59:54.266134: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=0
[2020-01-13 06:59:54.266179: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=0
[2020-01-13 06:59:54.266230: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=0
[2020-01-13 06:59:54.266201: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=1
[2020-01-13 06:59:54.266251: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:59:54.266327: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=1
[2020-01-13 06:59:54.266385: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=1
[2020-01-13 06:59:54.266448: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=1
[2020-01-13 06:59:54.266718: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=2
[2020-01-13 06:59:54.266831: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=2
[2020-01-13 06:59:54.266887: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=2
[2020-01-13 06:59:54.266949: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=2
[2020-01-13 06:59:54.267141: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=1
[2020-01-13 06:59:54.267199: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:59:54.267650: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=2
[2020-01-13 06:59:54.267694: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:59:54.268190: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=3
[2020-01-13 06:59:54.268307: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=3
[2020-01-13 06:59:54.268362: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=3
[2020-01-13 06:59:54.268422: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=3
[2020-01-13 06:59:54.269580: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=3
[2020-01-13 06:59:54.269626: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:59:54.270728: D byteps/common/global.cc:534] key 65536 assigned to server 0, accumulated workload for this server is 468
[2020-01-13 06:59:54.277742: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=0
[2020-01-13 06:59:54.277778: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:59:54.277936: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=3
[2020-01-13 06:59:54.277985: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:59:54.277970: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=2
[2020-01-13 06:59:54.277978: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=1
[2020-01-13 06:59:54.278022: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:59:54.278034: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:59:54.280187: D byteps/common/global.cc:534] key 196608 assigned to server 0, accumulated workload for this server is 868
[2020-01-13 06:59:54.285305: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=0
[2020-01-13 06:59:54.285375: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=0
[2020-01-13 06:59:54.285567: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=3
[2020-01-13 06:59:54.285646: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=3
[2020-01-13 06:59:54.285834: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=2
[2020-01-13 06:59:54.285892: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=1
[2020-01-13 06:59:54.285920: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=2
[2020-01-13 06:59:54.285996: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=1
[2020-01-13 06:59:54.287747: D byteps/common/global.cc:534] key 327680 assigned to server 0, accumulated workload for this server is 936
[2020-01-13 06:59:54.294493: D byteps/common/global.cc:353] Declared tensor byteps.tensor_201, declared key (not PS key): 6 rank=3
[2020-01-13 06:59:54.294586: D byteps/common/operations.cc:247] byteps.tensor_201 partitioned to 1 part(s), total_len=1156, key_range=[393216, 393216] rank=3
[2020-01-13 06:59:54.294887: D byteps/common/global.cc:353] Declared tensor byteps.tensor_201, declared key (not PS key): 6 rank=0
[2020-01-13 06:59:54.294979: D byteps/common/operations.cc:247] byteps.tensor_201 partitioned to 1 part(s), total_len=1156, key_range=[393216, 393216] rank=0
[2020-01-13 06:59:54.295177: D byteps/common/global.cc:353] Declared tensor byteps.tensor_201, declared key (not PS key): 6 rank=2
[2020-01-13 06:59:54.295263: D byteps/common/operations.cc:247] byteps.tensor_201 partitioned to 1 part(s), total_len=1156, key_range=[393216, 393216] rank=2
[2020-01-13 06:59:54.295298: D byteps/common/global.cc:534] key 393216 assigned to server 0, accumulated workload for this server is 2092
[2020-01-13 06:59:54.295410: D byteps/common/global.cc:353] Declared tensor byteps.tensor_201, declared key (not PS key): 6 rank=1
[2020-01-13 06:59:54.295495: D byteps/common/operations.cc:247] byteps.tensor_201 partitioned to 1 part(s), total_len=1156, key_range=[393216, 393216] rank=1
[2020-01-13 06:59:54.302033: D byteps/common/global.cc:353] Declared tensor byteps.tensor_202, declared key (not PS key): 7 rank=2
[2020-01-13 06:59:54.302078: D byteps/common/operations.cc:247] byteps.tensor_202 partitioned to 1 part(s), total_len=19652, key_range=[458752, 458752] rank=2
[2020-01-13 06:59:54.302159: D byteps/common/global.cc:353] Declared tensor byteps.tensor_202, declared key (not PS key): 7 rank=1
[2020-01-13 06:59:54.302205: D byteps/common/operations.cc:247] byteps.tensor_202 partitioned to 1 part(s), total_len=19652, key_range=[458752, 458752] rank=1
[2020-01-13 06:59:54.302479: D byteps/common/global.cc:353] Declared tensor byteps.tensor_202, declared key (not PS key): 7 rank=3
[2020-01-13 06:59:54.302569: D byteps/common/operations.cc:247] byteps.tensor_202 partitioned to 1 part(s), total_len=19652, key_range=[458752, 458752] rank=3
[2020-01-13 06:59:54.302617: D byteps/common/global.cc:353] Declared tensor byteps.tensor_202, declared key (not PS key): 7 rank=0
[2020-01-13 06:59:54.302694: D byteps/common/operations.cc:247] byteps.tensor_202 partitioned to 1 part(s), total_len=19652, key_range=[458752, 458752] rank=0
[2020-01-13 06:59:54.304295: D byteps/common/global.cc:534] key 458752 assigned to server 0, accumulated workload for this server is 21744
[2020-01-13 06:59:54.309854: D byteps/common/global.cc:353] Declared tensor byteps.tensor_203, declared key (not PS key): 8 rank=0
[2020-01-13 06:59:54.309897: D byteps/common/operations.cc:247] byteps.tensor_203 partitioned to 1 part(s), total_len=136, key_range=[524288, 524288] rank=0
[2020-01-13 06:59:54.310212: D byteps/common/global.cc:353] Declared tensor byteps.tensor_203, declared key (not PS key): 8 rank=2
[2020-01-13 06:59:54.310305: D byteps/common/operations.cc:247] byteps.tensor_203 partitioned to 1 part(s), total_len=136, key_range=[524288, 524288] rank=2
[2020-01-13 06:59:54.310464: D byteps/common/global.cc:353] Declared tensor byteps.tensor_203, declared key (not PS key): 8 rank=3
[2020-01-13 06:59:54.310518: D byteps/common/operations.cc:247] byteps.tensor_203 partitioned to 1 part(s), total_len=136, key_range=[524288, 524288] rank=3
[2020-01-13 06:59:54.310532: D byteps/common/global.cc:353] Declared tensor byteps.tensor_203, declared key (not PS key): 8 rank=1
[2020-01-13 06:59:54.310606: D byteps/common/operations.cc:247] byteps.tensor_203 partitioned to 1 part(s), total_len=136, key_range=[524288, 524288] rank=1
[2020-01-13 06:59:54.311873: D byteps/common/global.cc:534] key 524288 assigned to server 0, accumulated workload for this server is 21880
[2020-01-13 06:59:54.317531: D byteps/common/global.cc:353] Declared tensor byteps.tensor_204, declared key (not PS key): 9 rank=0
[2020-01-13 06:59:54.317573: D byteps/common/operations.cc:247] byteps.tensor_204 partitioned to 1 part(s), total_len=2312, key_range=[589824, 589824] rank=0
[2020-01-13 06:59:54.317731: D byteps/common/global.cc:353] Declared tensor byteps.tensor_204, declared key (not PS key): 9 rank=3
[2020-01-13 06:59:54.317766: D byteps/common/global.cc:353] Declared tensor byteps.tensor_204, declared key (not PS key): 9 rank=1
[2020-01-13 06:59:54.317813: D byteps/common/operations.cc:247] byteps.tensor_204 partitioned to 1 part(s), total_len=2312, key_range=[589824, 589824] rank=3
[2020-01-13 06:59:54.317835: D byteps/common/operations.cc:247] byteps.tensor_204 partitioned to 1 part(s), total_len=2312, key_range=[589824, 589824] rank=1
[2020-01-13 06:59:54.318054: D byteps/common/global.cc:353] Declared tensor byteps.tensor_204, declared key (not PS key): 9 rank=2
[2020-01-13 06:59:54.318123: D byteps/common/operations.cc:247] byteps.tensor_204 partitioned to 1 part(s), total_len=2312, key_range=[589824, 589824] rank=2
[2020-01-13 06:59:54.318826: D byteps/common/global.cc:534] key 589824 assigned to server 0, accumulated workload for this server is 24192
[2020-01-13 06:59:54.324171: D byteps/common/global.cc:353] Declared tensor byteps.tensor_205, declared key (not PS key): 10 rank=0
[2020-01-13 06:59:54.324214: D byteps/common/operations.cc:247] byteps.tensor_205 partitioned to 1 part(s), total_len=39304, key_range=[655360, 655360] rank=0
[2020-01-13 06:59:54.324575: D byteps/common/global.cc:353] Declared tensor byteps.tensor_205, declared key (not PS key): 10 rank=1
[2020-01-13 06:59:54.324651: D byteps/common/operations.cc:247] byteps.tensor_205 partitioned to 1 part(s), total_len=39304, key_range=[655360, 655360] rank=1
[2020-01-13 06:59:54.324789: D byteps/common/global.cc:353] Declared tensor byteps.tensor_205, declared key (not PS key): 10 rank=2
[2020-01-13 06:59:54.324857: D byteps/common/operations.cc:247] byteps.tensor_205 partitioned to 1 part(s), total_len=39304, key_range=[655360, 655360] rank=2
[2020-01-13 06:59:54.325701: D byteps/common/global.cc:353] Declared tensor byteps.tensor_205, declared key (not PS key): 10 rank=3
[2020-01-13 06:59:54.325779: D byteps/common/operations.cc:247] byteps.tensor_205 partitioned to 1 part(s), total_len=39304, key_range=[655360, 655360] rank=3
[2020-01-13 06:59:54.328382: D byteps/common/global.cc:534] key 655360 assigned to server 0, accumulated workload for this server is 63496
[2020-01-13 06:59:54.334157: D byteps/common/global.cc:353] Declared tensor byteps.tensor_206, declared key (not PS key): 11 rank=1
[2020-01-13 06:59:54.334211: D byteps/common/operations.cc:247] byteps.tensor_206 partitioned to 1 part(s), total_len=68, key_range=[720896, 720896] rank=1
[2020-01-13 06:59:54.334442: D byteps/common/global.cc:353] Declared tensor byteps.tensor_206, declared key (not PS key): 11 rank=2
[2020-01-13 06:59:54.334460: D byteps/common/global.cc:353] Declared tensor byteps.tensor_206, declared key (not PS key): 11 rank=3
[2020-01-13 06:59:54.334509: D byteps/common/operations.cc:247] byteps.tensor_206 partitioned to 1 part(s), total_len=68, key_range=[720896, 720896] rank=3
[2020-01-13 06:59:54.334510: D byteps/common/operations.cc:247] byteps.tensor_206 partitioned to 1 part(s), total_len=68, key_range=[720896, 720896] rank=2
[2020-01-13 06:59:54.334643: D byteps/common/global.cc:353] Declared tensor byteps.tensor_206, declared key (not PS key): 11 rank=0
[2020-01-13 06:59:54.334715: D byteps/common/operations.cc:247] byteps.tensor_206 partitioned to 1 part(s), total_len=68, key_range=[720896, 720896] rank=0
[2020-01-13 06:59:54.334973: D byteps/common/global.cc:534] key 720896 assigned to server 0, accumulated workload for this server is 63564
[2020-01-13 06:59:54.340641: D byteps/common/global.cc:353] Declared tensor byteps.tensor_207, declared key (not PS key): 12 rank=3
[2020-01-13 06:59:54.340716: D byteps/common/operations.cc:247] byteps.tensor_207 partitioned to 1 part(s), total_len=1156, key_range=[786432, 786432] rank=3
[2020-01-13 06:59:54.340858: D byteps/common/global.cc:353] Declared tensor byteps.tensor_207, declared key (not PS key): 12 rank=1
[2020-01-13 06:59:54.340874: D byteps/common/global.cc:353] Declared tensor byteps.tensor_207, declared key (not PS key): 12 rank=2
[2020-01-13 06:59:54.340906: D byteps/common/operations.cc:247] byteps.tensor_207 partitioned to 1 part(s), total_len=1156, key_range=[786432, 786432] rank=1
[2020-01-13 06:59:54.340935: D byteps/common/operations.cc:247] byteps.tensor_207 partitioned to 1 part(s), total_len=1156, key_range=[786432, 786432] rank=2
[2020-01-13 06:59:54.341214: D byteps/common/global.cc:353] Declared tensor byteps.tensor_207, declared key (not PS key): 12 rank=0
[2020-01-13 06:59:54.341261: D byteps/common/operations.cc:247] byteps.tensor_207 partitioned to 1 part(s), total_len=1156, key_range=[786432, 786432] rank=0
[2020-01-13 06:59:54.341487: D byteps/common/global.cc:534] key 786432 assigned to server 0, accumulated workload for this server is 64720
[2020-01-13 06:59:54.346820: D byteps/common/global.cc:353] Declared tensor byteps.tensor_208, declared key (not PS key): 13 rank=1
[2020-01-13 06:59:54.346873: D byteps/common/operations.cc:247] byteps.tensor_208 partitioned to 1 part(s), total_len=19652, key_range=[851968, 851968] rank=1
[2020-01-13 06:59:54.346874: D byteps/common/global.cc:353] Declared tensor byteps.tensor_208, declared key (not PS key): 13 rank=3
[2020-01-13 06:59:54.346905: D byteps/common/global.cc:353] Declared tensor byteps.tensor_208, declared key (not PS key): 13 rank=0
[2020-01-13 06:59:54.346918: D byteps/common/operations.cc:247] byteps.tensor_208 partitioned to 1 part(s), total_len=19652, key_range=[851968, 851968] rank=3
[2020-01-13 06:59:54.346947: D byteps/common/operations.cc:247] byteps.tensor_208 partitioned to 1 part(s), total_len=19652, key_range=[851968, 851968] rank=0
[2020-01-13 06:59:54.346960: D byteps/common/global.cc:353] Declared tensor byteps.tensor_208, declared key (not PS key): 13 rank=2
[2020-01-13 06:59:54.347022: D byteps/common/operations.cc:247] byteps.tensor_208 partitioned to 1 part(s), total_len=19652, key_range=[851968, 851968] rank=2
[2020-01-13 06:59:54.348653: D byteps/common/global.cc:534] key 851968 assigned to server 0, accumulated workload for this server is 84372
[2020-01-13 06:59:54.353824: D byteps/common/global.cc:353] Declared tensor byteps.tensor_209, declared key (not PS key): 14 rank=3
[2020-01-13 06:59:54.353868: D byteps/common/operations.cc:247] byteps.tensor_209 partitioned to 1 part(s), total_len=136, key_range=[917504, 917504] rank=3
[2020-01-13 06:59:54.354009: D byteps/common/global.cc:353] Declared tensor byteps.tensor_209, declared key (not PS key): 14 rank=2
[2020-01-13 06:59:54.354061: D byteps/common/global.cc:353] Declared tensor byteps.tensor_209, declared key (not PS key): 14 rank=0
[2020-01-13 06:59:54.354087: D byteps/common/operations.cc:247] byteps.tensor_209 partitioned to 1 part(s), total_len=136, key_range=[917504, 917504] rank=2
[2020-01-13 06:59:54.354134: D byteps/common/operations.cc:247] byteps.tensor_209 partitioned to 1 part(s), total_len=136, key_range=[917504, 917504] rank=0
[2020-01-13 06:59:54.354217: D byteps/common/global.cc:353] Declared tensor byteps.tensor_209, declared key (not PS key): 14 rank=1
[2020-01-13 06:59:54.354250: D byteps/common/global.cc:534] key 917504 assigned to server 0, accumulated workload for this server is 84508
[2020-01-13 06:59:54.354299: D byteps/common/operations.cc:247] byteps.tensor_209 partitioned to 1 part(s), total_len=136, key_range=[917504, 917504] rank=1
[2020-01-13 06:59:54.360684: D byteps/common/global.cc:353] Declared tensor byteps.tensor_210, declared key (not PS key): 15 rank=3
[2020-01-13 06:59:54.360758: D byteps/common/operations.cc:247] byteps.tensor_210 partitioned to 1 part(s), total_len=2312, key_range=[983040, 983040] rank=3
[2020-01-13 06:59:54.360794: D byteps/common/global.cc:353] Declared tensor byteps.tensor_210, declared key (not PS key): 15 rank=0
[2020-01-13 06:59:54.360838: D byteps/common/operations.cc:247] byteps.tensor_210 partitioned to 1 part(s), total_len=2312, key_range=[983040, 983040] rank=0
[2020-01-13 06:59:54.361044: D byteps/common/global.cc:353] Declared tensor byteps.tensor_210, declared key (not PS key): 15 rank=2
[2020-01-13 06:59:54.361108: D byteps/common/operations.cc:247] byteps.tensor_210 partitioned to 1 part(s), total_len=2312, key_range=[983040, 983040] rank=2
[2020-01-13 06:59:54.361478: D byteps/common/global.cc:353] Declared tensor byteps.tensor_210, declared key (not PS key): 15 rank=1
[2020-01-13 06:59:54.361541: D byteps/common/operations.cc:247] byteps.tensor_210 partitioned to 1 part(s), total_len=2312, key_range=[983040, 983040] rank=1
[2020-01-13 06:59:54.361799: D byteps/common/global.cc:534] key 983040 assigned to server 0, accumulated workload for this server is 86820
[2020-01-13 06:59:54.366933: D byteps/common/global.cc:353] Declared tensor byteps.tensor_211, declared key (not PS key): 16 rank=0
[2020-01-13 06:59:54.366976: D byteps/common/operations.cc:247] byteps.tensor_211 partitioned to 1 part(s), total_len=39304, key_range=[1048576, 1048576] rank=0
[2020-01-13 06:59:54.367283: D byteps/common/global.cc:353] Declared tensor byteps.tensor_211, declared key (not PS key): 16 rank=2
[2020-01-13 06:59:54.367323: D byteps/common/global.cc:353] Declared tensor byteps.tensor_211, declared key (not PS key): 16 rank=1
[2020-01-13 06:59:54.367344: D byteps/common/operations.cc:247] byteps.tensor_211 partitioned to 1 part(s), total_len=39304, key_range=[1048576, 1048576] rank=2
[2020-01-13 06:59:54.367398: D byteps/common/operations.cc:247] byteps.tensor_211 partitioned to 1 part(s), total_len=39304, key_range=[1048576, 1048576] rank=1
[2020-01-13 06:59:54.367527: D byteps/common/global.cc:353] Declared tensor byteps.tensor_211, declared key (not PS key): 16 rank=3
[2020-01-13 06:59:54.367603: D byteps/common/operations.cc:247] byteps.tensor_211 partitioned to 1 part(s), total_len=39304, key_range=[1048576, 1048576] rank=3
[2020-01-13 06:59:54.370356: D byteps/common/global.cc:534] key 1048576 assigned to server 0, accumulated workload for this server is 126124
tensor before push_pull: 
[-88.276085    9.469676  -28.467728   32.062294  -16.11681   -62.981438
  83.053955   19.23256    15.465963   89.03272    44.934155   -7.217002
  57.025764  -39.45068    89.96457   -55.9767      7.7130795]
<NDArray 17 @gpu(0)>
tensor after push_pull: 
[ -2.2648907 -26.942665   -3.1381292  30.620129   14.483914  -24.707336
  57.58606    17.05282     9.471867  -14.655745  -26.668665    4.1452713
  29.551579   31.442118   47.59712     5.5366697   6.2108374]
<NDArray 17 @gpu(0)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
test_byteps_push_pull_inplace passed
tensor before push_pull: 
[ 44.921947  -22.404182   19.133305   13.094497   11.725103  -69.86146
   0.1105547  33.98166    92.52082   -85.41071    16.735744   24.991835
 -77.336075   14.822626   25.756227   43.89869    58.843494 ]
<NDArray 17 @gpu(3)>
tensor after push_pull: 
[ -2.2648907 -26.942665   -3.1381292  30.620129   14.483914  -24.707336
  57.58606    17.05282     9.471867  -14.655745  -26.668665    4.1452713
  29.551579   31.442118   47.59712     5.5366697   6.2108374]
<NDArray 17 @gpu(3)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
test_byteps_push_pull_inplace passed
[2020-01-13 06:59:54.376056: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=0)
[2020-01-13 06:59:54.376233: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=3)
tensor before push_pull: 
[ 13.224661     0.13791323 -88.42125     25.27721     13.318825
  75.00076     74.871086   -56.566643   -52.70393     -7.9426765
 -81.11471    -57.324112    42.22908     79.71407     18.322575
   6.675613    56.61081   ]
<NDArray 17 @gpu(1)>
tensor after push_pull: 
[ -2.2648907 -26.942665   -3.1381292  30.620129   14.483914  -24.707336
  57.58606    17.05282     9.471867  -14.655745  -26.668665    4.1452713
  29.551579   31.442118   47.59712     5.5366697   6.2108374]
<NDArray 17 @gpu(1)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
test_byteps_push_pull_inplace passed
tensor before push_pull: 
[ 21.069908 -94.97407   85.203156  52.046513  49.008537 -40.987206
  72.30864   71.563705 -17.39539  -54.30231  -87.22985   56.130363
  96.28754   70.68246   56.345104  27.549076 -98.324036]
<NDArray 17 @gpu(2)>
tensor after push_pull: 
[ -2.2648907 -26.942665   -3.1381292  30.620129   14.483914  -24.707336
  57.58606    17.05282     9.471867  -14.655745  -26.668665    4.1452713
  29.551579   31.442118   47.59712     5.5366697   6.2108374]
<NDArray 17 @gpu(2)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
test_byteps_push_pull_inplace passed
[2020-01-13 06:59:54.376654: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=1)
[2020-01-13 06:59:54.376689: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=2)
[2020-01-13 06:59:54.380695: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:59:57.369277: D byteps/common/communicator.cc:203] listen thread joined (rank=3)
[2020-01-13 06:59:57.373258: D byteps/common/communicator.cc:203] listen thread joined (rank=3)
[2020-01-13 06:59:57.373420: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=3)
[2020-01-13 06:59:57.373536: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=3)
[2020-01-13 06:59:57.373569: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:59:57.373589: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=3)
[2020-01-13 06:59:57.373607: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
[2020-01-13 06:59:57.377591: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:59:57.377675: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=2)
[2020-01-13 06:59:57.377725: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=2)
[2020-01-13 06:59:57.377719: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:59:57.377756: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:59:57.377773: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=2)
[2020-01-13 06:59:57.377787: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
[2020-01-13 06:59:57.377761: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:59:57.377812: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=1)
[2020-01-13 06:59:57.377854: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=1)
[2020-01-13 06:59:57.377871: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:59:57.377864: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=0)
[2020-01-13 06:59:57.377887: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=1)
[2020-01-13 06:59:57.377902: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
[2020-01-13 06:59:57.377913: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=0)
[2020-01-13 06:59:57.377934: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:59:57.377952: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=0)
[2020-01-13 06:59:57.377968: D byteps/common/operations.cc:80] BytePS has been completely shutdown now