Closed jasperzhong closed 4 years ago
I change that line like this and it seems allright.
multiplied = tensor.copy()
BytePS launching worker
!!!Enable profiling for WORKER_ID: 0 and local_rank: 0!!!
BYTEPS_TRACE_START_STEP: BYTEPS_TRACE_END_STEP: BYTEPS_TRACE_DIR:
Command: python3 /usr/local/byteps/tests/test_mxnet.py
!!!Enable profiling for WORKER_ID: 0 and local_rank: 1!!!
BYTEPS_TRACE_START_STEP: BYTEPS_TRACE_END_STEP: BYTEPS_TRACE_DIR:
Command: python3 /usr/local/byteps/tests/test_mxnet.py
!!!Enable profiling for WORKER_ID: 0 and local_rank: 2!!!
BYTEPS_TRACE_START_STEP: BYTEPS_TRACE_END_STEP: BYTEPS_TRACE_DIR:
Command: python3 /usr/local/byteps/tests/test_mxnet.py
!!!Enable profiling for WORKER_ID: 0 and local_rank: 3!!!
BYTEPS_TRACE_START_STEP: BYTEPS_TRACE_END_STEP: BYTEPS_TRACE_DIR:
Command: python3 /usr/local/byteps/tests/test_mxnet.py
[2020-01-13 06:59:48.125568: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:59:48.125768: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_3
[2020-01-13 06:59:48.125820: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_3
[2020-01-13 06:59:48.125875: D byteps/common/communicator.cc:121] This is ROOT device, rank=3, all sockets create successfully
[2020-01-13 06:59:48.125890: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:59:48.125902: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:59:48.125913: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:59:48.125933: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:59:48.125943: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:59:48.125953: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:59:48.126002: D byteps/common/communicator.cc:164] Listening on socket 3
[2020-01-13 06:59:48.126034: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:59:48.126012: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl3
[2020-01-13 06:59:48.126229: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_2
[2020-01-13 06:59:48.126347: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl3
[2020-01-13 06:59:48.126363: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_2
[2020-01-13 06:59:48.126349: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:59:48.126387: D byteps/common/communicator.cc:121] This is WORKER device, rank=2, all sockets create successfully
[2020-01-13 06:59:48.126409: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:59:48.126427: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:59:48.126445: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:59:48.126451: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=3, all sockets create successfully
[2020-01-13 06:59:48.126446: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2020-01-13 06:59:48.126475: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:59:48.126473: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_1
[2020-01-13 06:59:48.126480: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:59:48.126496: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:59:48.126514: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:59:48.126556: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_0
[2020-01-13 06:59:48.126571: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_1
[2020-01-13 06:59:48.126589: D byteps/common/communicator.cc:121] This is WORKER device, rank=1, all sockets create successfully
[2020-01-13 06:59:48.126603: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:59:48.126616: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:59:48.126630: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:59:48.126601: D byteps/common/communicator.cc:164] Listening on socket 3
[2020-01-13 06:59:48.126624: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl2
[2020-01-13 06:59:48.126637: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_0
[2020-01-13 06:59:48.126666: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:59:48.126674: D byteps/common/communicator.cc:121] This is WORKER device, rank=0, all sockets create successfully
[2020-01-13 06:59:48.126682: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:59:48.126685: D byteps/common/global.cc:124] Partition size round up to 4096000 (bytes)
[2020-01-13 06:59:48.126695: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:59:48.126697: D byteps/common/global.cc:140] Using key hash function type: djb2
[2020-01-13 06:59:48.126708: D byteps/common/global.cc:151] Number of worker=1, launching distributed job
[2020-01-13 06:59:48.126724: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2020-01-13 06:59:48.126734: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2020-01-13 06:59:48.126728: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl2
[2020-01-13 06:59:48.126742: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2020-01-13 06:59:48.126750: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=2, all sockets create successfully
[2020-01-13 06:59:48.126764: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl1
[2020-01-13 06:59:48.126767: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:59:48.126792: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl0
[2020-01-13 06:59:48.126824: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl1
[2020-01-13 06:59:48.126840: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=1, all sockets create successfully
[2020-01-13 06:59:48.126847: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl0
[2020-01-13 06:59:48.126852: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:59:48.126859: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=0, all sockets create successfully
[2020-01-13 06:59:48.126868: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2020-01-13 06:59:51.732474: D byteps/common/nccl_manager.cc:104] root nccl_id is 1458356679286587394
[2020-01-13 06:59:51.732613: D byteps/common/nccl_manager.cc:114] recv nccl_id is 1458356679286587394, local_rank=2
[2020-01-13 06:59:51.761774: D byteps/common/nccl_manager.cc:114] recv nccl_id is 1458356679286587394, local_rank=0
[2020-01-13 06:59:51.967585: D byteps/common/nccl_manager.cc:114] recv nccl_id is 1458356679286587394, local_rank=1
[2020-01-13 06:59:52.973934: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:59:52.974010: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:59:52.974029: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:59:52.974045: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:59:52.974060: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:59:52.974075: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:59:52.974089: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:59:52.974104: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:59:52.974119: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:59:52.974134: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:59:52.974150: D byteps/common/global.cc:234] Inited rank=0 local_rank=0 size=4 local_size=4 worker_id=0
[2020-01-13 06:59:52.974522: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:59:52.974593: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:59:52.974614: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:59:52.974630: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:59:52.974646: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:59:52.974660: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:59:52.974675: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:59:52.974690: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:59:52.974705: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:59:52.974720: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:59:52.974736: D byteps/common/global.cc:234] Inited rank=3 local_rank=3 size=4 local_size=4 worker_id=0
[2020-01-13 06:59:52.974773: D byteps/common/global.cc:266] Started 8 background threads. rank=0
[2020-01-13 06:59:52.974868: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:59:52.974926: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:59:52.974946: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:59:52.974962: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:59:52.974977: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:59:52.974999: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:59:52.975015: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:59:52.975030: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:59:52.975045: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:59:52.975060: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:59:52.975077: D byteps/common/global.cc:234] Inited rank=1 local_rank=1 size=4 local_size=4 worker_id=0
[2020-01-13 06:59:52.975117: D byteps/common/global.cc:266] Started 6 background threads. rank=3
[2020-01-13 06:59:52.975288: D byteps/common/global.cc:226] Create schedule queue 0
[2020-01-13 06:59:52.975347: D byteps/common/global.cc:226] Create schedule queue 1
[2020-01-13 06:59:52.975369: D byteps/common/global.cc:226] Create schedule queue 2
[2020-01-13 06:59:52.975385: D byteps/common/global.cc:226] Create schedule queue 3
[2020-01-13 06:59:52.975400: D byteps/common/global.cc:226] Create schedule queue 4
[2020-01-13 06:59:52.975415: D byteps/common/global.cc:226] Create schedule queue 5
[2020-01-13 06:59:52.975437: D byteps/common/global.cc:226] Create schedule queue 6
[2020-01-13 06:59:52.975453: D byteps/common/global.cc:226] Create schedule queue 7
[2020-01-13 06:59:52.975468: D byteps/common/global.cc:226] Create schedule queue 8
[2020-01-13 06:59:52.975483: D byteps/common/global.cc:226] Create schedule queue 9
[2020-01-13 06:59:52.975500: D byteps/common/global.cc:234] Inited rank=2 local_rank=2 size=4 local_size=4 worker_id=0
[2020-01-13 06:59:52.975634: D byteps/common/global.cc:266] Started 8 background threads. rank=1
[2020-01-13 06:59:52.976382: D byteps/common/global.cc:266] Started 8 background threads. rank=2
[2020-01-13 06:59:53.994247: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=0
[2020-01-13 06:59:53.994377: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=0
[2020-01-13 06:59:54.132045: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=1
[2020-01-13 06:59:54.132193: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=1
[2020-01-13 06:59:54.138448: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=2
[2020-01-13 06:59:54.138555: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=2
[2020-01-13 06:59:54.142282: D byteps/common/global.cc:353] Declared tensor byteps.tensor_100, declared key (not PS key): 0 rank=3
[2020-01-13 06:59:54.142466: D byteps/common/operations.cc:247] byteps.tensor_100 partitioned to 1 part(s), total_len=68, key_range=[0, 0] rank=3
[06:59:54] src/./zmq_van.h:287: Start ZMQ recv thread
[06:59:54] src/./zmq_van.h:287: Start ZMQ recv thread
[06:59:54] src/./zmq_van.h:287: Start ZMQ recv thread
[2020-01-13 06:59:54.246060: D byteps/common/global.cc:534] key 0 assigned to server 0, accumulated workload for this server is 68
[2020-01-13 06:59:54.266007: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=0
[2020-01-13 06:59:54.266093: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=0
[2020-01-13 06:59:54.266134: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=0
[2020-01-13 06:59:54.266179: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=0
[2020-01-13 06:59:54.266230: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=0
[2020-01-13 06:59:54.266201: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=1
[2020-01-13 06:59:54.266251: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:59:54.266327: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=1
[2020-01-13 06:59:54.266385: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=1
[2020-01-13 06:59:54.266448: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=1
[2020-01-13 06:59:54.266718: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=2
[2020-01-13 06:59:54.266831: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=2
[2020-01-13 06:59:54.266887: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=2
[2020-01-13 06:59:54.266949: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=2
[2020-01-13 06:59:54.267141: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=1
[2020-01-13 06:59:54.267199: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:59:54.267650: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=2
[2020-01-13 06:59:54.267694: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:59:54.268190: D byteps/common/global.cc:353] Declared tensor byteps.parameter_0, declared key (not PS key): 1 rank=3
[2020-01-13 06:59:54.268307: D byteps/common/global.cc:353] Declared tensor byteps.gradient_0, declared key (not PS key): 2 rank=3
[2020-01-13 06:59:54.268362: D byteps/common/global.cc:353] Declared tensor byteps.parameter_1, declared key (not PS key): 3 rank=3
[2020-01-13 06:59:54.268422: D byteps/common/global.cc:353] Declared tensor byteps.gradient_1, declared key (not PS key): 4 rank=3
[2020-01-13 06:59:54.269580: D byteps/common/operations.cc:247] byteps.parameter_0 partitioned to 1 part(s), total_len=400, key_range=[65536, 65536] rank=3
[2020-01-13 06:59:54.269626: D byteps/common/operations.cc:267] byteps.parameter_0 is already on cpu, len=400
[2020-01-13 06:59:54.270728: D byteps/common/global.cc:534] key 65536 assigned to server 0, accumulated workload for this server is 468
[2020-01-13 06:59:54.277742: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=0
[2020-01-13 06:59:54.277778: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:59:54.277936: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=3
[2020-01-13 06:59:54.277985: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:59:54.277970: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=2
[2020-01-13 06:59:54.277978: D byteps/common/operations.cc:247] byteps.parameter_1 partitioned to 1 part(s), total_len=400, key_range=[196608, 196608] rank=1
[2020-01-13 06:59:54.278022: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:59:54.278034: D byteps/common/operations.cc:267] byteps.parameter_1 is already on cpu, len=400
[2020-01-13 06:59:54.280187: D byteps/common/global.cc:534] key 196608 assigned to server 0, accumulated workload for this server is 868
[2020-01-13 06:59:54.285305: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=0
[2020-01-13 06:59:54.285375: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=0
[2020-01-13 06:59:54.285567: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=3
[2020-01-13 06:59:54.285646: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=3
[2020-01-13 06:59:54.285834: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=2
[2020-01-13 06:59:54.285892: D byteps/common/global.cc:353] Declared tensor byteps.tensor_200, declared key (not PS key): 5 rank=1
[2020-01-13 06:59:54.285920: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=2
[2020-01-13 06:59:54.285996: D byteps/common/operations.cc:247] byteps.tensor_200 partitioned to 1 part(s), total_len=68, key_range=[327680, 327680] rank=1
[2020-01-13 06:59:54.287747: D byteps/common/global.cc:534] key 327680 assigned to server 0, accumulated workload for this server is 936
[2020-01-13 06:59:54.294493: D byteps/common/global.cc:353] Declared tensor byteps.tensor_201, declared key (not PS key): 6 rank=3
[2020-01-13 06:59:54.294586: D byteps/common/operations.cc:247] byteps.tensor_201 partitioned to 1 part(s), total_len=1156, key_range=[393216, 393216] rank=3
[2020-01-13 06:59:54.294887: D byteps/common/global.cc:353] Declared tensor byteps.tensor_201, declared key (not PS key): 6 rank=0
[2020-01-13 06:59:54.294979: D byteps/common/operations.cc:247] byteps.tensor_201 partitioned to 1 part(s), total_len=1156, key_range=[393216, 393216] rank=0
[2020-01-13 06:59:54.295177: D byteps/common/global.cc:353] Declared tensor byteps.tensor_201, declared key (not PS key): 6 rank=2
[2020-01-13 06:59:54.295263: D byteps/common/operations.cc:247] byteps.tensor_201 partitioned to 1 part(s), total_len=1156, key_range=[393216, 393216] rank=2
[2020-01-13 06:59:54.295298: D byteps/common/global.cc:534] key 393216 assigned to server 0, accumulated workload for this server is 2092
[2020-01-13 06:59:54.295410: D byteps/common/global.cc:353] Declared tensor byteps.tensor_201, declared key (not PS key): 6 rank=1
[2020-01-13 06:59:54.295495: D byteps/common/operations.cc:247] byteps.tensor_201 partitioned to 1 part(s), total_len=1156, key_range=[393216, 393216] rank=1
[2020-01-13 06:59:54.302033: D byteps/common/global.cc:353] Declared tensor byteps.tensor_202, declared key (not PS key): 7 rank=2
[2020-01-13 06:59:54.302078: D byteps/common/operations.cc:247] byteps.tensor_202 partitioned to 1 part(s), total_len=19652, key_range=[458752, 458752] rank=2
[2020-01-13 06:59:54.302159: D byteps/common/global.cc:353] Declared tensor byteps.tensor_202, declared key (not PS key): 7 rank=1
[2020-01-13 06:59:54.302205: D byteps/common/operations.cc:247] byteps.tensor_202 partitioned to 1 part(s), total_len=19652, key_range=[458752, 458752] rank=1
[2020-01-13 06:59:54.302479: D byteps/common/global.cc:353] Declared tensor byteps.tensor_202, declared key (not PS key): 7 rank=3
[2020-01-13 06:59:54.302569: D byteps/common/operations.cc:247] byteps.tensor_202 partitioned to 1 part(s), total_len=19652, key_range=[458752, 458752] rank=3
[2020-01-13 06:59:54.302617: D byteps/common/global.cc:353] Declared tensor byteps.tensor_202, declared key (not PS key): 7 rank=0
[2020-01-13 06:59:54.302694: D byteps/common/operations.cc:247] byteps.tensor_202 partitioned to 1 part(s), total_len=19652, key_range=[458752, 458752] rank=0
[2020-01-13 06:59:54.304295: D byteps/common/global.cc:534] key 458752 assigned to server 0, accumulated workload for this server is 21744
[2020-01-13 06:59:54.309854: D byteps/common/global.cc:353] Declared tensor byteps.tensor_203, declared key (not PS key): 8 rank=0
[2020-01-13 06:59:54.309897: D byteps/common/operations.cc:247] byteps.tensor_203 partitioned to 1 part(s), total_len=136, key_range=[524288, 524288] rank=0
[2020-01-13 06:59:54.310212: D byteps/common/global.cc:353] Declared tensor byteps.tensor_203, declared key (not PS key): 8 rank=2
[2020-01-13 06:59:54.310305: D byteps/common/operations.cc:247] byteps.tensor_203 partitioned to 1 part(s), total_len=136, key_range=[524288, 524288] rank=2
[2020-01-13 06:59:54.310464: D byteps/common/global.cc:353] Declared tensor byteps.tensor_203, declared key (not PS key): 8 rank=3
[2020-01-13 06:59:54.310518: D byteps/common/operations.cc:247] byteps.tensor_203 partitioned to 1 part(s), total_len=136, key_range=[524288, 524288] rank=3
[2020-01-13 06:59:54.310532: D byteps/common/global.cc:353] Declared tensor byteps.tensor_203, declared key (not PS key): 8 rank=1
[2020-01-13 06:59:54.310606: D byteps/common/operations.cc:247] byteps.tensor_203 partitioned to 1 part(s), total_len=136, key_range=[524288, 524288] rank=1
[2020-01-13 06:59:54.311873: D byteps/common/global.cc:534] key 524288 assigned to server 0, accumulated workload for this server is 21880
[2020-01-13 06:59:54.317531: D byteps/common/global.cc:353] Declared tensor byteps.tensor_204, declared key (not PS key): 9 rank=0
[2020-01-13 06:59:54.317573: D byteps/common/operations.cc:247] byteps.tensor_204 partitioned to 1 part(s), total_len=2312, key_range=[589824, 589824] rank=0
[2020-01-13 06:59:54.317731: D byteps/common/global.cc:353] Declared tensor byteps.tensor_204, declared key (not PS key): 9 rank=3
[2020-01-13 06:59:54.317766: D byteps/common/global.cc:353] Declared tensor byteps.tensor_204, declared key (not PS key): 9 rank=1
[2020-01-13 06:59:54.317813: D byteps/common/operations.cc:247] byteps.tensor_204 partitioned to 1 part(s), total_len=2312, key_range=[589824, 589824] rank=3
[2020-01-13 06:59:54.317835: D byteps/common/operations.cc:247] byteps.tensor_204 partitioned to 1 part(s), total_len=2312, key_range=[589824, 589824] rank=1
[2020-01-13 06:59:54.318054: D byteps/common/global.cc:353] Declared tensor byteps.tensor_204, declared key (not PS key): 9 rank=2
[2020-01-13 06:59:54.318123: D byteps/common/operations.cc:247] byteps.tensor_204 partitioned to 1 part(s), total_len=2312, key_range=[589824, 589824] rank=2
[2020-01-13 06:59:54.318826: D byteps/common/global.cc:534] key 589824 assigned to server 0, accumulated workload for this server is 24192
[2020-01-13 06:59:54.324171: D byteps/common/global.cc:353] Declared tensor byteps.tensor_205, declared key (not PS key): 10 rank=0
[2020-01-13 06:59:54.324214: D byteps/common/operations.cc:247] byteps.tensor_205 partitioned to 1 part(s), total_len=39304, key_range=[655360, 655360] rank=0
[2020-01-13 06:59:54.324575: D byteps/common/global.cc:353] Declared tensor byteps.tensor_205, declared key (not PS key): 10 rank=1
[2020-01-13 06:59:54.324651: D byteps/common/operations.cc:247] byteps.tensor_205 partitioned to 1 part(s), total_len=39304, key_range=[655360, 655360] rank=1
[2020-01-13 06:59:54.324789: D byteps/common/global.cc:353] Declared tensor byteps.tensor_205, declared key (not PS key): 10 rank=2
[2020-01-13 06:59:54.324857: D byteps/common/operations.cc:247] byteps.tensor_205 partitioned to 1 part(s), total_len=39304, key_range=[655360, 655360] rank=2
[2020-01-13 06:59:54.325701: D byteps/common/global.cc:353] Declared tensor byteps.tensor_205, declared key (not PS key): 10 rank=3
[2020-01-13 06:59:54.325779: D byteps/common/operations.cc:247] byteps.tensor_205 partitioned to 1 part(s), total_len=39304, key_range=[655360, 655360] rank=3
[2020-01-13 06:59:54.328382: D byteps/common/global.cc:534] key 655360 assigned to server 0, accumulated workload for this server is 63496
[2020-01-13 06:59:54.334157: D byteps/common/global.cc:353] Declared tensor byteps.tensor_206, declared key (not PS key): 11 rank=1
[2020-01-13 06:59:54.334211: D byteps/common/operations.cc:247] byteps.tensor_206 partitioned to 1 part(s), total_len=68, key_range=[720896, 720896] rank=1
[2020-01-13 06:59:54.334442: D byteps/common/global.cc:353] Declared tensor byteps.tensor_206, declared key (not PS key): 11 rank=2
[2020-01-13 06:59:54.334460: D byteps/common/global.cc:353] Declared tensor byteps.tensor_206, declared key (not PS key): 11 rank=3
[2020-01-13 06:59:54.334509: D byteps/common/operations.cc:247] byteps.tensor_206 partitioned to 1 part(s), total_len=68, key_range=[720896, 720896] rank=3
[2020-01-13 06:59:54.334510: D byteps/common/operations.cc:247] byteps.tensor_206 partitioned to 1 part(s), total_len=68, key_range=[720896, 720896] rank=2
[2020-01-13 06:59:54.334643: D byteps/common/global.cc:353] Declared tensor byteps.tensor_206, declared key (not PS key): 11 rank=0
[2020-01-13 06:59:54.334715: D byteps/common/operations.cc:247] byteps.tensor_206 partitioned to 1 part(s), total_len=68, key_range=[720896, 720896] rank=0
[2020-01-13 06:59:54.334973: D byteps/common/global.cc:534] key 720896 assigned to server 0, accumulated workload for this server is 63564
[2020-01-13 06:59:54.340641: D byteps/common/global.cc:353] Declared tensor byteps.tensor_207, declared key (not PS key): 12 rank=3
[2020-01-13 06:59:54.340716: D byteps/common/operations.cc:247] byteps.tensor_207 partitioned to 1 part(s), total_len=1156, key_range=[786432, 786432] rank=3
[2020-01-13 06:59:54.340858: D byteps/common/global.cc:353] Declared tensor byteps.tensor_207, declared key (not PS key): 12 rank=1
[2020-01-13 06:59:54.340874: D byteps/common/global.cc:353] Declared tensor byteps.tensor_207, declared key (not PS key): 12 rank=2
[2020-01-13 06:59:54.340906: D byteps/common/operations.cc:247] byteps.tensor_207 partitioned to 1 part(s), total_len=1156, key_range=[786432, 786432] rank=1
[2020-01-13 06:59:54.340935: D byteps/common/operations.cc:247] byteps.tensor_207 partitioned to 1 part(s), total_len=1156, key_range=[786432, 786432] rank=2
[2020-01-13 06:59:54.341214: D byteps/common/global.cc:353] Declared tensor byteps.tensor_207, declared key (not PS key): 12 rank=0
[2020-01-13 06:59:54.341261: D byteps/common/operations.cc:247] byteps.tensor_207 partitioned to 1 part(s), total_len=1156, key_range=[786432, 786432] rank=0
[2020-01-13 06:59:54.341487: D byteps/common/global.cc:534] key 786432 assigned to server 0, accumulated workload for this server is 64720
[2020-01-13 06:59:54.346820: D byteps/common/global.cc:353] Declared tensor byteps.tensor_208, declared key (not PS key): 13 rank=1
[2020-01-13 06:59:54.346873: D byteps/common/operations.cc:247] byteps.tensor_208 partitioned to 1 part(s), total_len=19652, key_range=[851968, 851968] rank=1
[2020-01-13 06:59:54.346874: D byteps/common/global.cc:353] Declared tensor byteps.tensor_208, declared key (not PS key): 13 rank=3
[2020-01-13 06:59:54.346905: D byteps/common/global.cc:353] Declared tensor byteps.tensor_208, declared key (not PS key): 13 rank=0
[2020-01-13 06:59:54.346918: D byteps/common/operations.cc:247] byteps.tensor_208 partitioned to 1 part(s), total_len=19652, key_range=[851968, 851968] rank=3
[2020-01-13 06:59:54.346947: D byteps/common/operations.cc:247] byteps.tensor_208 partitioned to 1 part(s), total_len=19652, key_range=[851968, 851968] rank=0
[2020-01-13 06:59:54.346960: D byteps/common/global.cc:353] Declared tensor byteps.tensor_208, declared key (not PS key): 13 rank=2
[2020-01-13 06:59:54.347022: D byteps/common/operations.cc:247] byteps.tensor_208 partitioned to 1 part(s), total_len=19652, key_range=[851968, 851968] rank=2
[2020-01-13 06:59:54.348653: D byteps/common/global.cc:534] key 851968 assigned to server 0, accumulated workload for this server is 84372
[2020-01-13 06:59:54.353824: D byteps/common/global.cc:353] Declared tensor byteps.tensor_209, declared key (not PS key): 14 rank=3
[2020-01-13 06:59:54.353868: D byteps/common/operations.cc:247] byteps.tensor_209 partitioned to 1 part(s), total_len=136, key_range=[917504, 917504] rank=3
[2020-01-13 06:59:54.354009: D byteps/common/global.cc:353] Declared tensor byteps.tensor_209, declared key (not PS key): 14 rank=2
[2020-01-13 06:59:54.354061: D byteps/common/global.cc:353] Declared tensor byteps.tensor_209, declared key (not PS key): 14 rank=0
[2020-01-13 06:59:54.354087: D byteps/common/operations.cc:247] byteps.tensor_209 partitioned to 1 part(s), total_len=136, key_range=[917504, 917504] rank=2
[2020-01-13 06:59:54.354134: D byteps/common/operations.cc:247] byteps.tensor_209 partitioned to 1 part(s), total_len=136, key_range=[917504, 917504] rank=0
[2020-01-13 06:59:54.354217: D byteps/common/global.cc:353] Declared tensor byteps.tensor_209, declared key (not PS key): 14 rank=1
[2020-01-13 06:59:54.354250: D byteps/common/global.cc:534] key 917504 assigned to server 0, accumulated workload for this server is 84508
[2020-01-13 06:59:54.354299: D byteps/common/operations.cc:247] byteps.tensor_209 partitioned to 1 part(s), total_len=136, key_range=[917504, 917504] rank=1
[2020-01-13 06:59:54.360684: D byteps/common/global.cc:353] Declared tensor byteps.tensor_210, declared key (not PS key): 15 rank=3
[2020-01-13 06:59:54.360758: D byteps/common/operations.cc:247] byteps.tensor_210 partitioned to 1 part(s), total_len=2312, key_range=[983040, 983040] rank=3
[2020-01-13 06:59:54.360794: D byteps/common/global.cc:353] Declared tensor byteps.tensor_210, declared key (not PS key): 15 rank=0
[2020-01-13 06:59:54.360838: D byteps/common/operations.cc:247] byteps.tensor_210 partitioned to 1 part(s), total_len=2312, key_range=[983040, 983040] rank=0
[2020-01-13 06:59:54.361044: D byteps/common/global.cc:353] Declared tensor byteps.tensor_210, declared key (not PS key): 15 rank=2
[2020-01-13 06:59:54.361108: D byteps/common/operations.cc:247] byteps.tensor_210 partitioned to 1 part(s), total_len=2312, key_range=[983040, 983040] rank=2
[2020-01-13 06:59:54.361478: D byteps/common/global.cc:353] Declared tensor byteps.tensor_210, declared key (not PS key): 15 rank=1
[2020-01-13 06:59:54.361541: D byteps/common/operations.cc:247] byteps.tensor_210 partitioned to 1 part(s), total_len=2312, key_range=[983040, 983040] rank=1
[2020-01-13 06:59:54.361799: D byteps/common/global.cc:534] key 983040 assigned to server 0, accumulated workload for this server is 86820
[2020-01-13 06:59:54.366933: D byteps/common/global.cc:353] Declared tensor byteps.tensor_211, declared key (not PS key): 16 rank=0
[2020-01-13 06:59:54.366976: D byteps/common/operations.cc:247] byteps.tensor_211 partitioned to 1 part(s), total_len=39304, key_range=[1048576, 1048576] rank=0
[2020-01-13 06:59:54.367283: D byteps/common/global.cc:353] Declared tensor byteps.tensor_211, declared key (not PS key): 16 rank=2
[2020-01-13 06:59:54.367323: D byteps/common/global.cc:353] Declared tensor byteps.tensor_211, declared key (not PS key): 16 rank=1
[2020-01-13 06:59:54.367344: D byteps/common/operations.cc:247] byteps.tensor_211 partitioned to 1 part(s), total_len=39304, key_range=[1048576, 1048576] rank=2
[2020-01-13 06:59:54.367398: D byteps/common/operations.cc:247] byteps.tensor_211 partitioned to 1 part(s), total_len=39304, key_range=[1048576, 1048576] rank=1
[2020-01-13 06:59:54.367527: D byteps/common/global.cc:353] Declared tensor byteps.tensor_211, declared key (not PS key): 16 rank=3
[2020-01-13 06:59:54.367603: D byteps/common/operations.cc:247] byteps.tensor_211 partitioned to 1 part(s), total_len=39304, key_range=[1048576, 1048576] rank=3
[2020-01-13 06:59:54.370356: D byteps/common/global.cc:534] key 1048576 assigned to server 0, accumulated workload for this server is 126124
tensor before push_pull:
[-88.276085 9.469676 -28.467728 32.062294 -16.11681 -62.981438
83.053955 19.23256 15.465963 89.03272 44.934155 -7.217002
57.025764 -39.45068 89.96457 -55.9767 7.7130795]
<NDArray 17 @gpu(0)>
tensor after push_pull:
[ -2.2648907 -26.942665 -3.1381292 30.620129 14.483914 -24.707336
57.58606 17.05282 9.471867 -14.655745 -26.668665 4.1452713
29.551579 31.442118 47.59712 5.5366697 6.2108374]
<NDArray 17 @gpu(0)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
test_byteps_push_pull_inplace passed
tensor before push_pull:
[ 44.921947 -22.404182 19.133305 13.094497 11.725103 -69.86146
0.1105547 33.98166 92.52082 -85.41071 16.735744 24.991835
-77.336075 14.822626 25.756227 43.89869 58.843494 ]
<NDArray 17 @gpu(3)>
tensor after push_pull:
[ -2.2648907 -26.942665 -3.1381292 30.620129 14.483914 -24.707336
57.58606 17.05282 9.471867 -14.655745 -26.668665 4.1452713
29.551579 31.442118 47.59712 5.5366697 6.2108374]
<NDArray 17 @gpu(3)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
test_byteps_push_pull_inplace passed
[2020-01-13 06:59:54.376056: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=0)
[2020-01-13 06:59:54.376233: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=3)
tensor before push_pull:
[ 13.224661 0.13791323 -88.42125 25.27721 13.318825
75.00076 74.871086 -56.566643 -52.70393 -7.9426765
-81.11471 -57.324112 42.22908 79.71407 18.322575
6.675613 56.61081 ]
<NDArray 17 @gpu(1)>
tensor after push_pull:
[ -2.2648907 -26.942665 -3.1381292 30.620129 14.483914 -24.707336
57.58606 17.05282 9.471867 -14.655745 -26.668665 4.1452713
29.551579 31.442118 47.59712 5.5366697 6.2108374]
<NDArray 17 @gpu(1)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
test_byteps_push_pull_inplace passed
tensor before push_pull:
[ 21.069908 -94.97407 85.203156 52.046513 49.008537 -40.987206
72.30864 71.563705 -17.39539 -54.30231 -87.22985 56.130363
96.28754 70.68246 56.345104 27.549076 -98.324036]
<NDArray 17 @gpu(2)>
tensor after push_pull:
[ -2.2648907 -26.942665 -3.1381292 30.620129 14.483914 -24.707336
57.58606 17.05282 9.471867 -14.655745 -26.668665 4.1452713
29.551579 31.442118 47.59712 5.5366697 6.2108374]
<NDArray 17 @gpu(2)>
test_byteps_push_pull passed
test_byteps_trainer_param_order passed
test_byteps_push_pull_inplace passed
[2020-01-13 06:59:54.376654: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=1)
[2020-01-13 06:59:54.376689: D byteps/common/global.cc:282] Shutdown BytePS: start to clean the resources (rank=2)
[2020-01-13 06:59:54.380695: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:59:57.369277: D byteps/common/communicator.cc:203] listen thread joined (rank=3)
[2020-01-13 06:59:57.373258: D byteps/common/communicator.cc:203] listen thread joined (rank=3)
[2020-01-13 06:59:57.373420: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=3)
[2020-01-13 06:59:57.373536: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=3)
[2020-01-13 06:59:57.373569: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:59:57.373589: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=3)
[2020-01-13 06:59:57.373607: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
[2020-01-13 06:59:57.377591: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:59:57.377675: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=2)
[2020-01-13 06:59:57.377725: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=2)
[2020-01-13 06:59:57.377719: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:59:57.377756: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:59:57.377773: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=2)
[2020-01-13 06:59:57.377787: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
[2020-01-13 06:59:57.377761: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2020-01-13 06:59:57.377812: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=1)
[2020-01-13 06:59:57.377854: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=1)
[2020-01-13 06:59:57.377871: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:59:57.377864: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=0)
[2020-01-13 06:59:57.377887: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=1)
[2020-01-13 06:59:57.377902: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
[2020-01-13 06:59:57.377913: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=0)
[2020-01-13 06:59:57.377934: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2020-01-13 06:59:57.377952: D byteps/common/global.cc:335] Shutdown BytePS: all BytePS resources has been cleaned (rank=0)
[2020-01-13 06:59:57.377968: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
Describe the bug
tests/test_mxnet.py
add test
test_byteps_push_pull_inplace
And I will get this error:
log
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
use docker/Dockerfile.mxnet
Additional context Add any other context about the problem here.
IMO, this line should be removed. This is because tensors are averaged by default. (mxnet/ops.py)