Open nowei opened 5 years ago
Hello @nowei , would you confirm that you can run EVAL_TYPE=benchmark with multiple GPUs? If so, we can narrow down the problem to be in train_mnist_byteps.py
Yeah, it's running it in the first screenshot for ten iterations, so yeah, it can run EVAL_TYPE=benchmark with multiple GPUs.
Would you set NCCL_DEBUG=INFO and run again? You may also set BYTEPS_LOG_LEVEL=INFO or even BYTEPS_LOG_LEVEL=TRACE. Then paste us the logs (it may be very long if you set BYTEPS_LOG_LEVEL=TRACE). Thanks.
Yeah, sorry for the late reply. It takes a while to set up each time and I was a bit busy over the last few days. Here's the log with NCCL_DEBUG=INFO
BytePS launching worker
training mnist...
training mnist...
training mnist...
training mnist...
ip-172-31-73-142:618:618 [3] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:618:618 [3] NCCL INFO Using internal Network Socket
ip-172-31-73-142:618:618 [3] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:618:618 [3] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:618:618 [3] NCCL INFO rank 3 nranks 4
ip-172-31-73-142:614:614 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:614:614 [1] NCCL INFO Using internal Network Socket
ip-172-31-73-142:614:614 [1] NCCL INFO rank 1 nranks 4
ip-172-31-73-142:617:617 [2] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:612:612 [0] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:617:617 [2] NCCL INFO Using internal Network Socket
ip-172-31-73-142:612:612 [0] NCCL INFO Using internal Network Socket
ip-172-31-73-142:617:617 [2] NCCL INFO rank 2 nranks 4
NCCL version 2.3.7+cuda9.0
ip-172-31-73-142:612:612 [0] NCCL INFO rank 0 nranks 4
ip-172-31-73-142:614:614 [1] NCCL INFO comm 0x4c601000 rank 1 nranks 4
ip-172-31-73-142:614:614 [1] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:614:614 [1] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:612:612 [0] NCCL INFO comm 0x49e9f4f0 rank 0 nranks 4
ip-172-31-73-142:612:612 [0] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:612:612 [0] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:617:617 [2] NCCL INFO comm 0x4c19dd40 rank 2 nranks 4
ip-172-31-73-142:617:617 [2] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:617:617 [2] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:618:618 [3] NCCL INFO comm 0x4b4a45a0 rank 3 nranks 4
ip-172-31-73-142:614:614 [1] NCCL INFO CUDA Dev 1, IP Interfaces : ens3(PHB)
ip-172-31-73-142:617:617 [2] NCCL INFO CUDA Dev 2, IP Interfaces : ens3(PHB)
ip-172-31-73-142:618:618 [3] NCCL INFO CUDA Dev 3, IP Interfaces : ens3(PHB)
ip-172-31-73-142:612:612 [0] NCCL INFO CUDA Dev 0, IP Interfaces : ens3(PHB)
ip-172-31-73-142:612:612 [0] NCCL INFO Using 128 threads
ip-172-31-73-142:612:612 [0] NCCL INFO Min Comp Cap 3
ip-172-31-73-142:612:612 [0] NCCL INFO Ring 00 : 0 1 2 3
ip-172-31-73-142:612:612 [0] NCCL INFO Ring 01 : 0 1 2 3
ip-172-31-73-142:614:614 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
ip-172-31-73-142:617:617 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
ip-172-31-73-142:618:618 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via P2P/IPC
ip-172-31-73-142:612:612 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
ip-172-31-73-142:612:612 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
ip-172-31-73-142:614:614 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
ip-172-31-73-142:617:617 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
ip-172-31-73-142:618:618 [3] NCCL INFO Ring 01 : 3[3] -> 0[0] via P2P/IPC
ip-172-31-73-142:617:617 [2] NCCL INFO comm 0x4c19dd40 rank 2 nranks 4 - COMPLETE
ip-172-31-73-142:612:612 [0] NCCL INFO comm 0x49e9f4f0 rank 0 nranks 4 - COMPLETE
ip-172-31-73-142:614:614 [1] NCCL INFO comm 0x4c601000 rank 1 nranks 4 - COMPLETE
ip-172-31-73-142:618:618 [3] NCCL INFO comm 0x4b4a45a0 rank 3 nranks 4 - COMPLETE
ip-172-31-73-142:612:771 [0] NCCL INFO Launch mode Parallel
[2019-12-01 06:58:34.178376: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20: 618 Aborted (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/byteps/launcher/launch.py", line 41, in worker
subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134
When I ran it with BYTEPS_LOG_LEVEL=INFO, it didn't give anything new. And then I'll make a separate post with the other one since it's pretty long.
Here's the log with both NCCL_DEBUG=INFO and BYTEPS_LOG_LEVEL=TRACE:
BytePS launching worker
training mnist...
training mnist...
training mnist...
training mnist...
[2019-12-01 06:56:10.396131: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-12-01 06:56:10.396279: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_2
[2019-12-01 06:56:10.396346: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_2
[2019-12-01 06:56:10.396366: D byteps/common/communicator.cc:121] This is WORKER device, rank=2, all sockets create successfully
[2019-12-01 06:56:10.396388: D byteps/common/global.cc:102] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-12-01 06:56:10.396405: D byteps/common/global.cc:134] Number of worker=1, launching non-distributed job
[2019-12-01 06:56:10.396431: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-12-01 06:56:10.396450: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2019-12-01 06:56:10.396465: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-12-01 06:56:10.396483: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-12-01 06:56:10.396528: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl2
[2019-12-01 06:56:10.396594: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl2
[2019-12-01 06:56:10.396613: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=2, all sockets create successfully
[2019-12-01 06:56:10.396621: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2019-12-01 06:56:10.396620: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_0
[2019-12-01 06:56:10.396700: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_0
[2019-12-01 06:56:10.396721: D byteps/common/communicator.cc:121] This is WORKER device, rank=0, all sockets create successfully
[2019-12-01 06:56:10.396741: D byteps/common/global.cc:102] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-12-01 06:56:10.396752: D byteps/common/global.cc:134] Number of worker=1, launching non-distributed job
[2019-12-01 06:56:10.396779: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-12-01 06:56:10.396796: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2019-12-01 06:56:10.396807: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-12-01 06:56:10.396840: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-12-01 06:56:10.396883: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl0
[2019-12-01 06:56:10.396958: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_1
[2019-12-01 06:56:10.396970: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl0
[2019-12-01 06:56:10.396985: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=0, all sockets create successfully
[2019-12-01 06:56:10.396997: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2019-12-01 06:56:10.397032: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_1
[2019-12-01 06:56:10.397046: D byteps/common/communicator.cc:121] This is WORKER device, rank=1, all sockets create successfully
[2019-12-01 06:56:10.397059: D byteps/common/global.cc:102] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-12-01 06:56:10.397069: D byteps/common/global.cc:134] Number of worker=1, launching non-distributed job
[2019-12-01 06:56:10.397086: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-12-01 06:56:10.397106: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2019-12-01 06:56:10.397114: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-12-01 06:56:10.397185: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl1
[2019-12-01 06:56:10.397188: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-12-01 06:56:10.397246: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl1
[2019-12-01 06:56:10.397259: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=1, all sockets create successfully
[2019-12-01 06:56:10.397275: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2019-12-01 06:56:10.397324: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_3
[2019-12-01 06:56:10.397413: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_3
[2019-12-01 06:56:10.397477: D byteps/common/communicator.cc:121] This is ROOT device, rank=3, all sockets create successfully
[2019-12-01 06:56:10.397505: D byteps/common/global.cc:102] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-12-01 06:56:10.397524: D byteps/common/global.cc:134] Number of worker=1, launching non-distributed job
[2019-12-01 06:56:10.397532: D byteps/common/communicator.cc:164] Listening on socket 3
[2019-12-01 06:56:10.397552: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-12-01 06:56:10.397584: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2019-12-01 06:56:10.397609: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-12-01 06:56:10.397702: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl3
[2019-12-01 06:56:10.397796: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl3
[2019-12-01 06:56:10.397888: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=3, all sockets create successfully
[2019-12-01 06:56:10.397915: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2019-12-01 06:56:10.397931: D byteps/common/communicator.cc:164] Listening on socket 3
ip-172-31-73-142:402:402 [3] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:402:402 [3] NCCL INFO Using internal Network Socket
ip-172-31-73-142:402:402 [3] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:402:402 [3] NCCL INFO NET/Socket : 1 interfaces found
[2019-12-01 06:56:13.400565: D byteps/common/nccl_manager.cc:104] root nccl_id is -8193983221049262078
ip-172-31-73-142:402:402 [3] NCCL INFO rank 3 nranks 4
[2019-12-01 06:56:13.400666: T byteps/common/communicator.cc:253] non-root socket recved: src=-1915158526, signal=-1907810388, key=0, myrank=0
[2019-12-01 06:56:13.400703: D byteps/common/nccl_manager.cc:114] recv nccl_id is -8193983221049262078, local_rank=0
[2019-12-01 06:56:13.400690: T byteps/common/communicator.cc:253] non-root socket recved: src=-1915158526, signal=-1907810388, key=0, myrank=1
[2019-12-01 06:56:13.400715: T byteps/common/communicator.cc:253] non-root socket recved: src=-1915158526, signal=-1907810388, key=0, myrank=2
[2019-12-01 06:56:13.400731: D byteps/common/nccl_manager.cc:114] recv nccl_id is -8193983221049262078, local_rank=1
[2019-12-01 06:56:13.400758: D byteps/common/nccl_manager.cc:114] recv nccl_id is -8193983221049262078, local_rank=2
ip-172-31-73-142:403:403 [2] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:399:399 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:395:395 [0] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:403:403 [2] NCCL INFO Using internal Network Socket
ip-172-31-73-142:399:399 [1] NCCL INFO Using internal Network Socket
ip-172-31-73-142:403:403 [2] NCCL INFO rank 2 nranks 4
ip-172-31-73-142:395:395 [0] NCCL INFO Using internal Network Socket
ip-172-31-73-142:399:399 [1] NCCL INFO rank 1 nranks 4
NCCL version 2.3.7+cuda9.0
ip-172-31-73-142:395:395 [0] NCCL INFO rank 0 nranks 4
ip-172-31-73-142:403:403 [2] NCCL INFO comm 0x4bc18eb0 rank 2 nranks 4
ip-172-31-73-142:403:403 [2] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:403:403 [2] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:395:395 [0] NCCL INFO comm 0x4cf74be0 rank 0 nranks 4
ip-172-31-73-142:395:395 [0] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:399:399 [1] NCCL INFO comm 0x4c48f820 rank 1 nranks 4
ip-172-31-73-142:395:395 [0] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:399:399 [1] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:399:399 [1] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:402:402 [3] NCCL INFO comm 0x4cad0600 rank 3 nranks 4
ip-172-31-73-142:399:399 [1] NCCL INFO CUDA Dev 1, IP Interfaces : ens3(PHB)
ip-172-31-73-142:403:403 [2] NCCL INFO CUDA Dev 2, IP Interfaces : ens3(PHB)
ip-172-31-73-142:402:402 [3] NCCL INFO CUDA Dev 3, IP Interfaces : ens3(PHB)
ip-172-31-73-142:395:395 [0] NCCL INFO CUDA Dev 0, IP Interfaces : ens3(PHB)
ip-172-31-73-142:395:395 [0] NCCL INFO Using 128 threads
ip-172-31-73-142:395:395 [0] NCCL INFO Min Comp Cap 3
ip-172-31-73-142:395:395 [0] NCCL INFO Ring 00 : 0 1 2 3
ip-172-31-73-142:395:395 [0] NCCL INFO Ring 01 : 0 1 2 3
ip-172-31-73-142:399:399 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
ip-172-31-73-142:402:402 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via P2P/IPC
ip-172-31-73-142:403:403 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
ip-172-31-73-142:395:395 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
ip-172-31-73-142:395:395 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
ip-172-31-73-142:399:399 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
ip-172-31-73-142:403:403 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
ip-172-31-73-142:402:402 [3] NCCL INFO Ring 01 : 3[3] -> 0[0] via P2P/IPC
ip-172-31-73-142:399:399 [1] NCCL INFO comm 0x4c48f820 rank 1 nranks 4 - COMPLETE
ip-172-31-73-142:395:395 [0] NCCL INFO comm 0x4cf74be0 rank 0 nranks 4 - COMPLETE
ip-172-31-73-142:402:402 [3] NCCL INFO comm 0x4cad0600 rank 3 nranks 4 - COMPLETE
ip-172-31-73-142:403:403 [2] NCCL INFO comm 0x4bc18eb0 rank 2 nranks 4 - COMPLETE
[2019-12-01 06:56:13.607995: D byteps/common/global.cc:193] Create schedule queue 0
[2019-12-01 06:56:13.608031: D byteps/common/global.cc:193] Create schedule queue 1
[2019-12-01 06:56:13.608042: D byteps/common/global.cc:193] Create schedule queue 2
[2019-12-01 06:56:13.608054: D byteps/common/global.cc:193] Create schedule queue 3
[2019-12-01 06:56:13.608070: D byteps/common/global.cc:193] Create schedule queue 4
[2019-12-01 06:56:13.608079: D byteps/common/global.cc:193] Create schedule queue 5
[2019-12-01 06:56:13.608088: D byteps/common/global.cc:193] Create schedule queue 6
[2019-12-01 06:56:13.608101: D byteps/common/global.cc:193] Create schedule queue 7
[2019-12-01 06:56:13.608102: D byteps/common/global.cc:193] Create schedule queue 0
[2019-12-01 06:56:13.608115: D byteps/common/global.cc:193] Create schedule queue 8
[2019-12-01 06:56:13.608125: D byteps/common/global.cc:193] Create schedule queue 9
[2019-12-01 06:56:13.608126: D byteps/common/global.cc:193] Create schedule queue 1
[2019-12-01 06:56:13.608141: D byteps/common/global.cc:193] Create schedule queue 2
[2019-12-01 06:56:13.608141: D byteps/common/global.cc:201] Inited rank=1 local_rank=1 size=4 local_size=4 worker_id=0
[2019-12-01 06:56:13.608150: D byteps/common/global.cc:193] Create schedule queue 3
[2019-12-01 06:56:13.608160: D byteps/common/global.cc:193] Create schedule queue 4
[2019-12-01 06:56:13.608167: D byteps/common/global.cc:193] Create schedule queue 5
[2019-12-01 06:56:13.608176: D byteps/common/global.cc:193] Create schedule queue 6
[2019-12-01 06:56:13.608194: D byteps/common/global.cc:193] Create schedule queue 7
[2019-12-01 06:56:13.608203: D byteps/common/global.cc:193] Create schedule queue 8
[2019-12-01 06:56:13.608210: D byteps/common/global.cc:193] Create schedule queue 9
[2019-12-01 06:56:13.608226: D byteps/common/global.cc:201] Inited rank=0 local_rank=0 size=4 local_size=4 worker_id=0
[2019-12-01 06:56:13.608244: D byteps/common/global.cc:193] Create schedule queue 0
[2019-12-01 06:56:13.608277: D byteps/common/global.cc:193] Create schedule queue 1
[2019-12-01 06:56:13.608308: D byteps/common/global.cc:193] Create schedule queue 2
[2019-12-01 06:56:13.608323: D byteps/common/global.cc:193] Create schedule queue 3
[2019-12-01 06:56:13.608337: D byteps/common/global.cc:193] Create schedule queue 4
[2019-12-01 06:56:13.608344: D byteps/common/global.cc:193] Create schedule queue 5
[2019-12-01 06:56:13.608357: D byteps/common/global.cc:193] Create schedule queue 6
[2019-12-01 06:56:13.608359: D byteps/common/global.cc:193] Create schedule queue 0
[2019-12-01 06:56:13.608371: D byteps/common/global.cc:193] Create schedule queue 7
[2019-12-01 06:56:13.608385: D byteps/common/global.cc:193] Create schedule queue 8
[2019-12-01 06:56:13.608387: D byteps/common/global.cc:193] Create schedule queue 1
[2019-12-01 06:56:13.608397: D byteps/common/global.cc:193] Create schedule queue 9
[2019-12-01 06:56:13.608408: D byteps/common/global.cc:193] Create schedule queue 2
[2019-12-01 06:56:13.608413: D byteps/common/global.cc:201] Inited rank=3 local_rank=3 size=4 local_size=4 worker_id=0
[2019-12-01 06:56:13.608420: D byteps/common/global.cc:233] Started 4 background threads. rank=1
[2019-12-01 06:56:13.608421: D byteps/common/global.cc:233] Started 4 background threads. rank=0
[2019-12-01 06:56:13.608424: D byteps/common/global.cc:193] Create schedule queue 3
[2019-12-01 06:56:13.608440: D byteps/common/global.cc:193] Create schedule queue 4
[2019-12-01 06:56:13.608454: D byteps/common/global.cc:193] Create schedule queue 5
[2019-12-01 06:56:13.608463: D byteps/common/global.cc:193] Create schedule queue 6
[2019-12-01 06:56:13.608477: D byteps/common/global.cc:193] Create schedule queue 7
[2019-12-01 06:56:13.608491: D byteps/common/global.cc:193] Create schedule queue 8
[2019-12-01 06:56:13.608505: D byteps/common/global.cc:193] Create schedule queue 9
[2019-12-01 06:56:13.608521: D byteps/common/global.cc:201] Inited rank=2 local_rank=2 size=4 local_size=4 worker_id=0
[2019-12-01 06:56:13.608550: D byteps/common/global.cc:233] Started 2 background threads. rank=3
[2019-12-01 06:56:13.608752: D byteps/common/global.cc:233] Started 4 background threads. rank=2
[2019-12-01 06:56:13.743501: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.bias, declared key (not PS key): 0 rank=1
[2019-12-01 06:56:13.743558: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.weight, declared key (not PS key): 1 rank=1
[2019-12-01 06:56:13.743580: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.bias, declared key (not PS key): 2 rank=1
[2019-12-01 06:56:13.743593: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.weight, declared key (not PS key): 3 rank=1
[2019-12-01 06:56:13.743611: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.bias, declared key (not PS key): 4 rank=1
[2019-12-01 06:56:13.743632: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.weight, declared key (not PS key): 5 rank=1
[2019-12-01 06:56:13.743645: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.bias, declared key (not PS key): 6 rank=1
[2019-12-01 06:56:13.743663: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.weight, declared key (not PS key): 7 rank=1
[2019-12-01 06:56:13.743685: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.bias, declared key (not PS key): 8 rank=1
[2019-12-01 06:56:13.743699: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.weight, declared key (not PS key): 9 rank=1
[2019-12-01 06:56:13.743725: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.bias, declared key (not PS key): 10 rank=1
[2019-12-01 06:56:13.743746: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.weight, declared key (not PS key): 11 rank=1
[2019-12-01 06:56:13.743769: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.bias, declared key (not PS key): 12 rank=1
[2019-12-01 06:56:13.743781: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.weight, declared key (not PS key): 13 rank=1
[2019-12-01 06:56:13.743794: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.bias, declared key (not PS key): 14 rank=1
[2019-12-01 06:56:13.743811: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.weight, declared key (not PS key): 15 rank=1
[2019-12-01 06:56:13.743911: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.bias, declared key (not PS key): 0 rank=0
[2019-12-01 06:56:13.743970: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.weight, declared key (not PS key): 1 rank=0
[2019-12-01 06:56:13.743990: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.bias, declared key (not PS key): 2 rank=0
[2019-12-01 06:56:13.744012: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.weight, declared key (not PS key): 3 rank=0
[2019-12-01 06:56:13.744028: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.bias, declared key (not PS key): 4 rank=0
[2019-12-01 06:56:13.744049: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.weight, declared key (not PS key): 5 rank=0
[2019-12-01 06:56:13.744062: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.bias, declared key (not PS key): 6 rank=0
[2019-12-01 06:56:13.744090: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.weight, declared key (not PS key): 7 rank=0
[2019-12-01 06:56:13.744121: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.bias, declared key (not PS key): 8 rank=0
[2019-12-01 06:56:13.744156: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.weight, declared key (not PS key): 9 rank=0
[2019-12-01 06:56:13.744187: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.bias, declared key (not PS key): 10 rank=0
[2019-12-01 06:56:13.744202: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.weight, declared key (not PS key): 11 rank=0
[2019-12-01 06:56:13.744216: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.bias, declared key (not PS key): 12 rank=0
[2019-12-01 06:56:13.744205: D byteps/common/operations.cc:229] byteps.Parameter.conv1.bias partitioned to 1 part(s), total_len=40, key_range=[524288, 524288] rank=1
[2019-12-01 06:56:13.744238: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.weight, declared key (not PS key): 13 rank=0
[2019-12-01 06:56:13.744239: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.744253: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.bias, declared key (not PS key): 14 rank=0
[2019-12-01 06:56:13.744268: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.weight, declared key (not PS key): 15 rank=0
[2019-12-01 06:56:13.744269: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.bias, declared key (not PS key): 0 rank=3
[2019-12-01 06:56:13.744311: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.weight, declared key (not PS key): 1 rank=3
[2019-12-01 06:56:13.744349: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.bias, declared key (not PS key): 2 rank=3
[2019-12-01 06:56:13.744368: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.weight, declared key (not PS key): 3 rank=3
[2019-12-01 06:56:13.744381: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.bias, declared key (not PS key): 4 rank=3
[2019-12-01 06:56:13.744401: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.weight, declared key (not PS key): 5 rank=3
[2019-12-01 06:56:13.744420: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.bias, declared key (not PS key): 6 rank=3
[2019-12-01 06:56:13.744444: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.weight, declared key (not PS key): 7 rank=3
[2019-12-01 06:56:13.744462: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.bias, declared key (not PS key): 8 rank=3
[2019-12-01 06:56:13.744478: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.weight, declared key (not PS key): 9 rank=3
[2019-12-01 06:56:13.744495: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.bias, declared key (not PS key): 10 rank=3
[2019-12-01 06:56:13.744537: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.weight, declared key (not PS key): 11 rank=3
[2019-12-01 06:56:13.744569: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.bias, declared key (not PS key): 12 rank=3
[2019-12-01 06:56:13.744585: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.weight, declared key (not PS key): 13 rank=3
[2019-12-01 06:56:13.744605: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.bias, declared key (not PS key): 14 rank=3
[2019-12-01 06:56:13.744618: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.weight, declared key (not PS key): 15 rank=3
[2019-12-01 06:56:13.744617: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.744639: T byteps/common/operations.cc:264] byteps.Parameter.conv1.bias: open shared memory size 40
[2019-12-01 06:56:13.744648: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.744640: D byteps/common/operations.cc:229] byteps.Parameter.conv1.bias partitioned to 1 part(s), total_len=40, key_range=[524288, 524288] rank=0
[2019-12-01 06:56:13.744665: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.744666: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.bias_0, key=524288, offset=0, len=40, device=1 rank=1
[2019-12-01 06:56:13.744679: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.744688: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.bias, rank=1
[2019-12-01 06:56:13.744760: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.744804: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.744831: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.744909: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=0, rank=1, key=524288
[2019-12-01 06:56:13.744946: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=524288, myrank=3
[2019-12-01 06:56:13.745054: D byteps/common/operations.cc:229] byteps.Parameter.conv1.bias partitioned to 1 part(s), total_len=40, key_range=[524288, 524288] rank=3
[2019-12-01 06:56:13.745085: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.745084: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.745112: T byteps/common/operations.cc:264] byteps.Parameter.conv1.bias: open shared memory size 40
[2019-12-01 06:56:13.745121: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.745144: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.bias_0, key=524288, offset=0, len=40, device=0 rank=0
[2019-12-01 06:56:13.745167: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.745191: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.bias, rank=0
[2019-12-01 06:56:13.745317: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.745351: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.745368: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.745412: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=0, rank=0, key=524288
[2019-12-01 06:56:13.745434: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=524288, myrank=3
[2019-12-01 06:56:13.745487: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.745521: T byteps/common/operations.cc:264] byteps.Parameter.conv1.bias: open shared memory size 40
[2019-12-01 06:56:13.745531: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.745561: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.bias_0, key=524288, offset=0, len=40, device=3 rank=3
[2019-12-01 06:56:13.745581: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 3
[2019-12-01 06:56:13.745598: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.bias, rank=3
[2019-12-01 06:56:13.745794: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.bias, declared key (not PS key): 0 rank=2
[2019-12-01 06:56:13.745835: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.weight, declared key (not PS key): 1 rank=2
[2019-12-01 06:56:13.745861: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.bias, declared key (not PS key): 2 rank=2
[2019-12-01 06:56:13.745884: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.weight, declared key (not PS key): 3 rank=2
[2019-12-01 06:56:13.745907: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.bias, declared key (not PS key): 4 rank=2
[2019-12-01 06:56:13.745930: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.weight, declared key (not PS key): 5 rank=2
[2019-12-01 06:56:13.745960: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.bias, declared key (not PS key): 6 rank=2
[2019-12-01 06:56:13.745990: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.weight, declared key (not PS key): 7 rank=2
[2019-12-01 06:56:13.746024: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.bias, declared key (not PS key): 8 rank=2
[2019-12-01 06:56:13.746049: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.weight, declared key (not PS key): 9 rank=2
[2019-12-01 06:56:13.746073: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.bias, declared key (not PS key): 10 rank=2
[2019-12-01 06:56:13.746094: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.weight, declared key (not PS key): 11 rank=2
[2019-12-01 06:56:13.746123: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.bias, declared key (not PS key): 12 rank=2
[2019-12-01 06:56:13.746146: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.weight, declared key (not PS key): 13 rank=2
[2019-12-01 06:56:13.746178: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.bias, declared key (not PS key): 14 rank=2
[2019-12-01 06:56:13.746215: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.weight, declared key (not PS key): 15 rank=2
[2019-12-01 06:56:13.746604: D byteps/common/operations.cc:229] byteps.Parameter.conv1.bias partitioned to 1 part(s), total_len=40, key_range=[524288, 524288] rank=2
[2019-12-01 06:56:13.746641: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.747030: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.747062: T byteps/common/operations.cc:264] byteps.Parameter.conv1.bias: open shared memory size 40
[2019-12-01 06:56:13.747079: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.747096: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.bias_0, key=524288, offset=0, len=40, device=2 rank=2
[2019-12-01 06:56:13.747112: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.747121: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.bias, rank=2
[2019-12-01 06:56:13.747253: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.747282: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.747298: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.747349: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=0, rank=2, key=524288
[2019-12-01 06:56:13.747363: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=524288, myrank=3
[2019-12-01 06:56:13.747542: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 3
[2019-12-01 06:56:13.747629: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL REDUCE (rank=3) key=524288, elements=10, device=3
[2019-12-01 06:56:13.747640: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=524288, myrank=1
[2019-12-01 06:56:13.747643: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=524288, myrank=2
[2019-12-01 06:56:13.747663: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.747654: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=524288, myrank=0
[2019-12-01 06:56:13.747693: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.747696: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.747703: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL REDUCE (rank=1) key=524288, elements=10, device=1
[2019-12-01 06:56:13.747724: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL REDUCE (rank=2) key=524288, elements=10, device=2
[2019-12-01 06:56:13.747726: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL REDUCE (rank=0) key=524288, elements=10, device=0
[2019-12-01 06:56:13.747728: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.747738: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.747754: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.747759: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
ip-172-31-73-142:395:554 [0] NCCL INFO Launch mode Parallel
[2019-12-01 06:56:13.747968: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.747977: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.747982: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.747997: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.748006: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.747998: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.748011: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.748022: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 3
[2019-12-01 06:56:13.748025: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.748028: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.748025: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.748045: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.748057: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.748083: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.748103: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.748112: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.748142: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.748141: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=2, rank=0, key=524288
[2019-12-01 06:56:13.748160: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.748158: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=524288, myrank=3
[2019-12-01 06:56:13.748183: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.748215: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=2, rank=1, key=524288
[2019-12-01 06:56:13.748218: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.748228: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=524288, myrank=3
[2019-12-01 06:56:13.748237: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.748280: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=2, rank=2, key=524288
[2019-12-01 06:56:13.748288: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=524288, myrank=3
[2019-12-01 06:56:13.748439: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 3
[2019-12-01 06:56:13.748490: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=524288, myrank=0
[2019-12-01 06:56:13.748499: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL BROADCAST (rank=3) key=524288, elements=10, device=3
[2019-12-01 06:56:13.748498: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=524288, myrank=1
[2019-12-01 06:56:13.748506: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=524288, myrank=2
[2019-12-01 06:56:13.748519: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.748521: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.748524: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.748541: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL BROADCAST (rank=0) key=524288, elements=10, device=0
[2019-12-01 06:56:13.748541: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL BROADCAST (rank=1) key=524288, elements=10, device=1
[2019-12-01 06:56:13.748547: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL BROADCAST (rank=2) key=524288, elements=10, device=2
[2019-12-01 06:56:13.748556: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.748565: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.748571: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.748571: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.748743: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.conv1.bias_0
[2019-12-01 06:56:13.748748: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.conv1.bias_0
[2019-12-01 06:56:13.748752: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.conv1.bias_0
[2019-12-01 06:56:13.748773: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.748774: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.748776: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.748816: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.conv1.bias_0
[2019-12-01 06:56:13.748851: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.749169: D byteps/common/operations.cc:229] byteps.Parameter.conv1.weight partitioned to 1 part(s), total_len=1000, key_range=[589824, 589824] rank=0
[2019-12-01 06:56:13.749197: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.749499: D byteps/common/operations.cc:229] byteps.Parameter.conv1.weight partitioned to 1 part(s), total_len=1000, key_range=[589824, 589824] rank=3
[2019-12-01 06:56:13.749527: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.749576: T byteps/common/shared_memory.cc:43] initialized share memory size 1000
[2019-12-01 06:56:13.749603: T byteps/common/operations.cc:264] byteps.Parameter.conv1.weight: open shared memory size 1000
[2019-12-01 06:56:13.749623: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.749647: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.weight_0, key=589824, offset=0, len=1000, device=0 rank=0
[2019-12-01 06:56:13.749674: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.749686: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.weight, rank=0
[2019-12-01 06:56:13.749826: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.749852: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.749862: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.749890: D byteps/common/operations.cc:229] byteps.Parameter.conv1.weight partitioned to 1 part(s), total_len=1000, key_range=[589824, 589824] rank=1
[2019-12-01 06:56:13.749920: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=0, rank=0, key=589824
[2019-12-01 06:56:13.749915: T byteps/common/shared_memory.cc:43] initialized share memory size 1000
[2019-12-01 06:56:13.749922: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.weight, size=1000, parts=1
[[2019-12-01 06:56:13.749974: T byteps/common/operations.cc:264] byteps.Parameter.conv1.weight: open shared memory size 1000
[2019-12-01 06:56:13.749995: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.weight, size=1000, parts=1
2019-12-01 06:56:13.749941: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=589824, myrank=3
[2019-12-01 06:56:13.750013: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.weight_0, key=589824, offset=0, len=1000, device=3 rank=3
[2019-12-01 06:56:13.750025: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 3
[2019-12-01 06:56:13.750034: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.weight, rank=3
[2019-12-01 06:56:13.750059: D byteps/common/operations.cc:229] byteps.Parameter.conv1.weight partitioned to 1 part(s), total_len=1000, key_range=[589824, 589824] rank=2
[2019-12-01 06:56:13.750086: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.750476: T byteps/common/shared_memory.cc:43] initialized share memory size 1000
[2019-12-01 06:56:13.750509: T byteps/common/operations.cc:264] byteps.Parameter.conv1.weight: open shared memory size 1000
[2019-12-01 06:56:13.750530: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.750559: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.weight_0, key=589824, offset=0, len=1000, device=1 rank=1
[2019-12-01 06:56:13.750582: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.750603: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.weight, rank=1
[2019-12-01 06:56:13.750686: T byteps/common/shared_memory.cc:43] initialized share memory size 1000
[2019-12-01 06:56:13.750714: T byteps/common/operations.cc:264] byteps.Parameter.conv1.weight: open shared memory size 1000
[2019-12-01 06:56:13.750730: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.750760: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.weight_0, key=589824, offset=0, len=1000, device=2 rank=2
[2019-12-01 06:56:13.750752: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.750778: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.750783: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.750796: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.weight, rank=2
[2019-12-01 06:56:13.750803: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.750840: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=0, rank=1, key=589824
[2019-12-01 06:56:13.750821: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.750850: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=589824, myrank=3
[2019-12-01 06:56:13.750866: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.750883: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.750940: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=0, rank=2, key=589824
[2019-12-01 06:56:13.750950: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=589824, myrank=3
[2019-12-01 06:56:13.751032: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 3
[2019-12-01 06:56:13.751097: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=589824, myrank=0
[2019-12-01 06:56:13.751110: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL REDUCE (rank=3) key=589824, elements=250, device=3
[2019-12-01 06:56:13.751126: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.751120: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=589824, myrank=2
[2019-12-01 06:56:13.751108: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=589824, myrank=1
[2019-12-01 06:56:13.751148: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.751152: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL REDUCE (rank=0) key=589824, elements=250, device=0
[2019-12-01 06:56:13.751162: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.751164: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.751176: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL REDUCE (rank=2) key=589824, elements=250, device=2
[2019-12-01 06:56:13.751177: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.751186: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL REDUCE (rank=1) key=589824, elements=250, device=1
[2019-12-01 06:56:13.751201: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.751209: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.751301: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751344: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.751372: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.751391: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751391: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751422: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.751426: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 3
[2019-12-01 06:56:13.751435: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.751452: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.751431: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751480: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.751482: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.751494: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.751505: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751523: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.751557: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=2, rank=0, key=589824
[2019-12-01 06:56:13.751572: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=589824, myrank=3
[2019-12-01 06:56:13.751568: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.751593: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.751602: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751627: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751629: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.751648: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.751686: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=2, rank=2, key=589824
[2019-12-01 06:56:13.751688: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=2, rank=1, key=589824
[2019-12-01 06:56:13.751709: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=589824, myrank=3
[2019-12-01 06:56:13.751736: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=589824, myrank=3
[2019-12-01 06:56:13.751747: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 3
[2019-12-01 06:56:13.751834: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=589824, myrank=0
[2019-12-01 06:56:13.751850: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=589824, myrank=1
[2019-12-01 06:56:13.751854: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL BROADCAST (rank=3) key=589824, elements=250, device=3
[2019-12-01 06:56:13.751858: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.751865: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=589824, myrank=2
[2019-12-01 06:56:13.751884: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.751890: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL BROADCAST (rank=0) key=589824, elements=250, device=0
[2019-12-01 06:56:13.751895: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.751900: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL BROADCAST (rank=1) key=589824, elements=250, device=1
[2019-12-01 06:56:13.751911: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.751916: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL BROADCAST (rank=2) key=589824, elements=250, device=2
[2019-12-01 06:56:13.751920: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.751935: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.751928: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.752001: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.conv1.weight_0
[2019-12-01 06:56:13.752004: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.conv1.weight_0
[2019-12-01 06:56:13.752002: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.conv1.weight_0
[2019-12-01 06:56:13.752026: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.752028: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.752031: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.752113: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.conv1.weight_0
[2019-12-01 06:56:13.752145: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.752237: D byteps/common/operations.cc:229] byteps.Parameter.conv2.bias partitioned to 1 part(s), total_len=80, key_range=[655360, 655360] rank=1
[2019-12-01 06:56:13.752277: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.752446: D byteps/common/operations.cc:229] byteps.Parameter.conv2.bias partitioned to 1 part(s), total_len=80, key_range=[655360, 655360] rank=2
[2019-12-01 06:56:13.752475: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.752601: D byteps/common/operations.cc:229] byteps.Parameter.conv2.bias partitioned to 1 part(s), total_len=80, key_range=[655360, 655360] rank=0
[2019-12-01 06:56:13.752630: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.752870: T byteps/common/shared_memory.cc:43] initialized share memory size 80
[2019-12-01 06:56:13.752900: T byteps/common/operations.cc:264] byteps.Parameter.conv2.bias: open shared memory size 80
[2019-12-01 06:56:13.752923: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.752950: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.bias_0, key=655360, offset=0, len=80, device=1 rank=1
[2019-12-01 06:56:13.752969: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.752984: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.bias, rank=1
[2019-12-01 06:56:13.752966: D byteps/common/operations.cc:229] byteps.Parameter.conv2.bias partitioned to 1 part(s), total_len=80, key_range=[655360, 655360] rank=3
[2019-12-01 06:56:13.753005: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.753103: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.753135: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.753153: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.753180: T byteps/common/shared_memory.cc:43] initialized share memory size 80
[2019-12-01 06:56:13.753193: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=0, rank=1, key=655360
[2019-12-01 06:56:13.753213: T byteps/common/operations.cc:264] byteps.Parameter.conv2.bias: open shared memory size 80
[2019-12-01 06:56:13.753216: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=655360, myrank=3
[2019-12-01 06:56:13.753244: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.753278: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.bias_0, key=655360, offset=0, len=80, device=2 rank=2
[2019-12-01 06:56:13.753298: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.753320: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.bias, rank=2
[2019-12-01 06:56:13.753403: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.753438: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.753460: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.753497: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=0, rank=2, key=655360
[2019-12-01 06:56:13.753508: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=655360, myrank=3
[2019-12-01 06:56:13.753510: T byteps/common/shared_memory.cc:43] initialized share memory size 80
[2019-12-01 06:56:13.753544: T byteps/common/operations.cc:264] byteps.Parameter.conv2.bias: open shared memory size 80
[2019-12-01 06:56:13.753573: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.753591: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.bias_0, key=655360, offset=0, len=80, device=0 rank=0
[2019-12-01 06:56:13.753609: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.753626: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.bias, rank=0
[2019-12-01 06:56:13.753715: T byteps/common/shared_memory.cc:43] initialized share memory size 80
[2019-12-01 06:56:13.753740: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.753749: T byteps/common/operations.cc:264] byteps.Parameter.conv2.bias: open shared memory size 80
[2019-12-01 06:56:13.753766: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.753784: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.753803: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.753837: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=0, rank=0, key=655360
[2019-12-01 06:56:13.753847: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.bias_0, key=655360, offset=0, len=80, device=3 rank=3
[2019-12-01 06:56:13.753850: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=655360, myrank=3
[2019-12-01 06:56:13.753876: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 3
[2019-12-01 06:56:13.753920: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.bias, rank=3
[2019-12-01 06:56:13.754048: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 3
[2019-12-01 06:56:13.754110: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=655360, myrank=0
[2019-12-01 06:56:13.754122: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL REDUCE (rank=3) key=655360, elements=20, device=3
[2019-12-01 06:56:13.754122: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=655360, myrank=1
[2019-12-01 06:56:13.754140: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.754131: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=655360, myrank=2
[2019-12-01 06:56:13.754158: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.754172: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.754178: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL REDUCE (rank=0) key=655360, elements=20, device=0
[2019-12-01 06:56:13.754181: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.754196: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.754196: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL REDUCE (rank=2) key=655360, elements=20, device=2
[2019-12-01 06:56:13.754192: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL REDUCE (rank=1) key=655360, elements=20, device=1
[2019-12-01 06:56:13.754214: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.754219: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.754299: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754304: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754326: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 3
[2019-12-01 06:56:13.754327: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.754344: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.754348: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.754388: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754426: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.754447: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.754455: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754453: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.754484: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.754488: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754501: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.754503: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.754553: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=2, rank=0, key=655360
[2019-12-01 06:56:13.754568: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=655360, myrank=3
[2019-12-01 06:56:13.754564: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.754601: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754620: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.754613: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.754647: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754653: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=2, rank=2, key=655360
[2019-12-01 06:56:13.754661: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=655360, myrank=3
[2019-12-01 06:56:13.754663: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.754704: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=2, rank=1, key=655360
[2019-12-01 06:56:13.754711: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=655360, myrank=3
[2019-12-01 06:56:13.754760: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 3
[2019-12-01 06:56:13.754812: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=655360, myrank=0
[2019-12-01 06:56:13.754830: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.754824: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL BROADCAST (rank=3) key=655360, elements=20, device=3
[2019-12-01 06:56:13.754831: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=655360, myrank=2
[2019-12-01 06:56:13.754832: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=655360, myrank=1
[2019-12-01 06:56:13.754849: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.754864: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL BROADCAST (rank=0) key=655360, elements=20, device=0
[2019-12-01 06:56:13.754863: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.754871: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL BROADCAST (rank=2) key=655360, elements=20, device=2
[2019-12-01 06:56:13.754891: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.754899: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.754917: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.754917: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL BROADCAST (rank=1) key=655360, elements=20, device=1
[2019-12-01 06:56:13.754953: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.755050: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.conv2.bias_0
[2019-12-01 06:56:13.755081: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.755147: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.conv2.bias_0
[2019-12-01 06:56:13.755164: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.conv2.bias_0
[2019-12-01 06:56:13.755177: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.755188: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.conv2.bias_0
[2019-12-01 06:56:13.755202: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.755225: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.755333: D byteps/common/operations.cc:229] byteps.Parameter.conv2.weight partitioned to 1 part(s), total_len=20000, key_range=[720896, 720896] rank=3
[2019-12-01 06:56:13.755360: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.755709: T byteps/common/shared_memory.cc:43] initialized share memory size 20000
[2019-12-01 06:56:13.755710: D byteps/common/operations.cc:229] byteps.Parameter.conv2.weight partitioned to 1 part(s), total_len=20000, key_range=[720896, 720896] rank=1
[2019-12-01 06:56:13.755747: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.755748: T byteps/common/operations.cc:264] byteps.Parameter.conv2.weight: open shared memory size 20000
[2019-12-01 06:56:13.755766: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.755793: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.weight_0, key=720896, offset=0, len=20000, device=3 rank=3
[2019-12-01 06:56:13.755816: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 3
[2019-12-01 06:56:13.755838: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.weight, rank=3
[2019-12-01 06:56:13.755840: D byteps/common/operations.cc:229] byteps.Parameter.conv2.weight partitioned to 1 part(s), total_len=20000, key_range=[720896, 720896] rank=2
[2019-12-01 06:56:13.755870: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.756067: D byteps/common/operations.cc:229] byteps.Parameter.conv2.weight partitioned to 1 part(s), total_len=20000, key_range=[720896, 720896] rank=0
[2019-12-01 06:56:13.756100: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.756229: T byteps/common/shared_memory.cc:43] initialized share memory size 20000
[2019-12-01 06:56:13.756251: T byteps/common/operations.cc:264] byteps.Parameter.conv2.weight: open shared memory size 20000
[2019-12-01 06:56:13.756260: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.756287: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.weight_0, key=720896, offset=0, len=20000, device=1 rank=1
[2019-12-01 06:56:13.756310: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.756333: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.weight, rank=1
[2019-12-01 06:56:13.756356: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.756391: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.756401: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.756456: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=0, rank=1, key=720896
[2019-12-01 06:56:13.756475: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=720896, myrank=3
[2019-12-01 06:56:13.756581: T byteps/common/shared_memory.cc:43] initialized share memory size 20000
[2019-12-01 06:56:13.756608: T byteps/common/operations.cc:264] byteps.Parameter.conv2.weight: open shared memory size 20000
[2019-12-01 06:56:13.756627: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.756661: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.weight_0, key=720896, offset=0, len=20000, device=2 rank=2
[2019-12-01 06:56:13.756692: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.756714: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.weight, rank=2
[2019-12-01 06:56:13.756793: T byteps/common/shared_memory.cc:43] initialized share memory size 20000
[2019-12-01 06:56:13.756821: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.756829: T byteps/common/operations.cc:264] byteps.Parameter.conv2.weight: open shared memory size 20000
[2019-12-01 06:56:13.756844: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.756854: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.756864: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.756873: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.weight_0, key=720896, offset=0, len=20000, device=0 rank=0
[2019-12-01 06:56:13.756894: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.756897: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=0, rank=2, key=720896
[2019-12-01 06:56:13.756909: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.weight, rank=0
[2019-12-01 06:56:13.756912: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=720896, myrank=3
[2019-12-01 06:56:13.756975: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757009: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757037: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757073: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=0, rank=0, key=720896
[2019-12-01 06:56:13.757084: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=720896, myrank=3
[2019-12-01 06:56:13.757192: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 3
[2019-12-01 06:56:13.757251: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=720896, myrank=0
[2019-12-01 06:56:13.757262: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL REDUCE (rank=3) key=720896, elements=5000, device=3
[2019-12-01 06:56:13.757268: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=720896, myrank=2
[2019-12-01 06:56:13.757276: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757265: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=720896, myrank=1
[2019-12-01 06:56:13.757294: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.757300: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL REDUCE (rank=0) key=720896, elements=5000, device=0
[2019-12-01 06:56:13.757299: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.757314: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.757319: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL REDUCE (rank=2) key=720896, elements=5000, device=2
[2019-12-01 06:56:13.757323: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.757327: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL REDUCE (rank=1) key=720896, elements=5000, device=1
[2019-12-01 06:56:13.757341: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.757362: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.757469: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757486: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757513: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 3
[2019-12-01 06:56:13.757514: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.757534: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.757538: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.757529: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757563: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.757558: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.757585: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757585: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.757587: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757604: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.757626: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757641: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.757650: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=2, rank=2, key=720896
[2019-12-01 06:56:13.757655: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.757664: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=720896, myrank=3
[2019-12-01 06:56:13.757684: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757682: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757704: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757704: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.757723: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757761: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=2, rank=0, key=720896
[2019-12-01 06:56:13.757772: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=2, rank=1, key=720896
[2019-12-01 06:56:13.757812: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=720896, myrank=3
[2019-12-01 06:56:13.757835: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=720896, myrank=3
[2019-12-01 06:56:13.757889: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 3
[2019-12-01 06:56:13.757940: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=720896, myrank=0
[2019-12-01 06:56:13.757951: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL BROADCAST (rank=3) key=720896, elements=5000, device=3
[2019-12-01 06:56:13.757958: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=720896, myrank=2
[2019-12-01 06:56:13.757956: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=720896, myrank=1
[2019-12-01 06:56:13.757968: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757975: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.757979: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.757984: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL BROADCAST (rank=0) key=720896, elements=5000, device=0
[2019-12-01 06:56:13.757990: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL BROADCAST (rank=2) key=720896, elements=5000, device=2
[2019-12-01 06:56:13.757994: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL BROADCAST (rank=1) key=720896, elements=5000, device=1
[2019-12-01 06:56:13.757997: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.758004: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.758010: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.758051: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.758171: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.conv2.weight_0
[2019-12-01 06:56:13.758204: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.758222: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.conv2.weight_0
[2019-12-01 06:56:13.758239: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.conv2.weight_0
[2019-12-01 06:56:13.758253: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.758278: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.conv2.weight_0
[2019-12-01 06:56:13.758281: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.758304: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.758450: D byteps/common/operations.cc:229] byteps.Parameter.fc1.bias partitioned to 1 part(s), total_len=200, key_range=[786432, 786432] rank=0
[2019-12-01 06:56:13.758479: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.758835: D byteps/common/operations.cc:229] byteps.Parameter.fc1.bias partitioned to 1 part(s), total_len=200, key_range=[786432, 786432] rank=3
[2019-12-01 06:56:13.758854: T byteps/common/shared_memory.cc:43] initialized share memory size 200
[2019-12-01 06:56:13.758881: T byteps/common/operations.cc:264] byteps.Parameter.fc1.bias: open shared memory size 200
[2019-12-01 06:56:13.758861: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.758907: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.758925: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.bias_0, key=786432, offset=0, len=200, device=0 rank=0
[2019-12-01 06:56:13.758948: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.758969: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.bias, rank=0
[2019-12-01 06:56:13.759015: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.759033: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.759042: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.759090: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=0, rank=0, key=786432
[2019-12-01 06:56:13.759106: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=786432, myrank=3
[2019-12-01 06:56:13.759204: D byteps/common/operations.cc:229] byteps.Parameter.fc1.bias partitioned to 1 part(s), total_len=200, key_range=[786432, 786432] rank=1
[2019-12-01 06:56:13.759236: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.759233: T byteps/common/shared_memory.cc:43] initialized share memory size 200
[2019-12-01 06:56:13.759262: T byteps/common/operations.cc:264] byteps.Parameter.fc1.bias: open shared memory size 200
[2019-12-01 06:56:13.759271: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.759288: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.bias_0, key=786432, offset=0, len=200, device=3 rank=3
[2019-12-01 06:56:13.759307: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 3
[2019-12-01 06:56:13.759325: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.bias, rank=3
[2019-12-01 06:56:13.759448: D byteps/common/operations.cc:229] byteps.Parameter.fc1.bias partitioned to 1 part(s), total_len=200, key_range=[786432, 786432] rank=2
[2019-12-01 06:56:13.759480: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.759614: T byteps/common/shared_memory.cc:43] initialized share memory size 200
[2019-12-01 06:56:13.759634: T byteps/common/operations.cc:264] byteps.Parameter.fc1.bias: open shared memory size 200
[2019-12-01 06:56:13.759643: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.759667: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.bias_0, key=786432, offset=0, len=200, device=1 rank=1
[2019-12-01 06:56:13.759691: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.759707: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.bias, rank=1
[2019-12-01 06:56:13.759739: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.759772: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.759789: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.759814: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=0, rank=1, key=786432
[2019-12-01 06:56:13.759824: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=786432, myrank=3
[2019-12-01 06:56:13.759929: T byteps/common/shared_memory.cc:43] initialized share memory size 200
[2019-12-01 06:56:13.759959: T byteps/common/operations.cc:264] byteps.Parameter.fc1.bias: open shared memory size 200
[2019-12-01 06:56:13.759976: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.760003: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.bias_0, key=786432, offset=0, len=200, device=2 rank=2
[2019-12-01 06:56:13.760028: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760042: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.bias, rank=2
[2019-12-01 06:56:13.760202: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760228: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760243: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760273: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=0, rank=2, key=786432
[2019-12-01 06:56:13.760283: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=786432, myrank=3
[2019-12-01 06:56:13.760323: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 3
[2019-12-01 06:56:13.760394: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=786432, myrank=0
[2019-12-01 06:56:13.760405: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL REDUCE (rank=3) key=786432, elements=50, device=3
[2019-12-01 06:56:13.760406: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=786432, myrank=1
[2019-12-01 06:56:13.760411: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=786432, myrank=2
[2019-12-01 06:56:13.760422: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.760443: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760444: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.760448: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL REDUCE (rank=0) key=786432, elements=50, device=0
[2019-12-01 06:56:13.760474: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL REDUCE (rank=2) key=786432, elements=50, device=2
[2019-12-01 06:56:13.760476: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL REDUCE (rank=1) key=786432, elements=50, device=1
[2019-12-01 06:56:13.760477: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.760472: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.760500: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.760503: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.760592: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760618: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.760628: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760637: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.760657: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.760670: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760681: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.760705: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 3
[2019-12-01 06:56:13.760702: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.760727: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.760727: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760722: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760744: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.760747: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760763: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.760773: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=2, rank=1, key=786432
[2019-12-01 06:56:13.760773: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.760789: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=786432, myrank=3
[2019-12-01 06:56:13.760796: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760829: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.760856: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=2, rank=0, key=786432
[2019-12-01 06:56:13.760860: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760868: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=786432, myrank=3
[2019-12-01 06:56:13.760886: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760896: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760928: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=2, rank=2, key=786432
[2019-12-01 06:56:13.760934: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=786432, myrank=3
[2019-12-01 06:56:13.761051: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 3
[2019-12-01 06:56:13.761118: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=786432, myrank=0
[2019-12-01 06:56:13.761131: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=786432, myrank=1
[2019-12-01 06:56:13.761139: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=786432, myrank=2
[2019-12-01 06:56:13.761137: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL BROADCAST (rank=3) key=786432, elements=50, device=3
[2019-12-01 06:56:13.761147: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.761148: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.761168: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.761189: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL BROADCAST (rank=1) key=786432, elements=50, device=1
[2019-12-01 06:56:13.761198: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL BROADCAST (rank=0) key=786432, elements=50, device=0
[2019-12-01 06:56:13.761200: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL BROADCAST (rank=2) key=786432, elements=50, device=2
[2019-12-01 06:56:13.761221: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.761222: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.761223: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.761225: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.761304: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.fc1.bias_0
[2019-12-01 06:56:13.761340: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.761378: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.fc1.bias_0
[2019-12-01 06:56:13.761398: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.fc1.bias_0
[2019-12-01 06:56:13.761416: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.761415: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.761425: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.fc1.bias_0
[2019-12-01 06:56:13.761519: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.761776: D byteps/common/operations.cc:229] byteps.Parameter.fc1.weight partitioned to 1 part(s), total_len=64000, key_range=[851968, 851968] rank=2
[2019-12-01 06:56:13.761848: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.761967: D byteps/common/operations.cc:229] byteps.Parameter.fc1.weight partitioned to 1 part(s), total_len=64000, key_range=[851968, 851968] rank=0
[2019-12-01 06:56:13.761993: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.762284: D byteps/common/operations.cc:229] byteps.Parameter.fc1.weight partitioned to 1 part(s), total_len=64000, key_range=[851968, 851968] rank=3
[2019-12-01 06:56:13.762325: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.762343: T byteps/common/shared_memory.cc:43] initialized share memory size 64000
[2019-12-01 06:56:13.762371: T byteps/common/operations.cc:264] byteps.Parameter.fc1.weight: open shared memory size 64000
[2019-12-01 06:56:13.762390: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.762418: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.weight_0, key=851968, offset=0, len=64000, device=2 rank=2
[2019-12-01 06:56:13.762441: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.762468: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.weight, rank=2
[2019-12-01 06:56:13.762600: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.762635: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.762633: D byteps/common/operations.cc:229] byteps.Parameter.fc1.weight partitioned to 1 part(s), total_len=64000, key_range=[851968, 851968] rank=1
[2019-12-01 06:56:13.762663: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.762663: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.762675: T byteps/common/shared_memory.cc:43] initialized share memory size 64000
[2019-12-01 06:56:13.762704: T byteps/common/operations.cc:264] byteps.Parameter.fc1.weight: open shared memory size 64000
[2019-12-01 06:56:13.762728: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.762729: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=0, rank=2, key=851968
[2019-12-01 06:56:13.762743: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=851968, myrank=3
[2019-12-01 06:56:13.762755: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.weight_0, key=851968, offset=0, len=64000, device=0 rank=0
[2019-12-01 06:56:13.762791: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.762809: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.weight, rank=0
[2019-12-01 06:56:13.762909: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.762937: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.762951: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.762981: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=0, rank=0, key=851968
[2019-12-01 06:56:13.762988: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=851968, myrank=3
[2019-12-01 06:56:13.763027: T byteps/common/shared_memory.cc:43] initialized share memory size 64000
[2019-12-01 06:56:13.763061: T byteps/common/operations.cc:264] byteps.Parameter.fc1.weight: open shared memory size 64000
[2019-12-01 06:56:13.763076: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.763101: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.weight_0, key=851968, offset=0, len=64000, device=3 rank=3
[2019-12-01 06:56:13.763117: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 3
[2019-12-01 06:56:13.763125: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.weight, rank=3
[2019-12-01 06:56:13.763234: T byteps/common/shared_memory.cc:43] initialized share memory size 64000
[2019-12-01 06:56:13.763260: T byteps/common/operations.cc:264] byteps.Parameter.fc1.weight: open shared memory size 64000
[2019-12-01 06:56:13.763275: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.763295: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.weight_0, key=851968, offset=0, len=64000, device=1 rank=1
[2019-12-01 06:56:13.763318: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.763332: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.weight, rank=1
[2019-12-01 06:56:13.763408: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.763446: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.763471: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.763522: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=0, rank=1, key=851968
[2019-12-01 06:56:13.763530: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=851968, myrank=3
[2019-12-01 06:56:13.763703: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 3
[2019-12-01 06:56:13.763757: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=851968, myrank=0
[2019-12-01 06:56:13.763769: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL REDUCE (rank=3) key=851968, elements=16000, device=3
[2019-12-01 06:56:13.763774: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=851968, myrank=1
[2019-12-01 06:56:13.763776: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=851968, myrank=2
[2019-12-01 06:56:13.763784: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.763801: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.763802: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.763807: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL REDUCE (rank=0) key=851968, elements=16000, device=0
[2019-12-01 06:56:13.763823: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL REDUCE (rank=1) key=851968, elements=16000, device=1
[2019-12-01 06:56:13.763824: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.763824: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL REDUCE (rank=2) key=851968, elements=16000, device=2
[2019-12-01 06:56:13.763827: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.763842: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.763845: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.763978: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764008: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.764007: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764053: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.764056: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.764065: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.764066: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764078: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764108: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 3
[2019-12-01 06:56:13.764111: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.764112: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.764112: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.764132: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.764134: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764134: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.764139: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764147: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.764155: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.764177: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=2, rank=2, key=851968
[2019-12-01 06:56:13.764192: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.764190: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=851968, myrank=3
[2019-12-01 06:56:13.764196: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=2, rank=0, key=851968
[2019-12-01 06:56:13.764215: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764219: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=851968, myrank=3
[2019-12-01 06:56:13.764232: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.764271: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=2, rank=1, key=851968
[2019-12-01 06:56:13.764288: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=851968, myrank=3
[2019-12-01 06:56:13.764372: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 3
[2019-12-01 06:56:13.764426: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=851968, myrank=0
[2019-12-01 06:56:13.764434: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL BROADCAST (rank=3) key=851968, elements=16000, device=3
[2019-12-01 06:56:13.764435: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=851968, myrank=1
[2019-12-01 06:56:13.764445: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.764442: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=851968, myrank=2
[2019-12-01 06:56:13.764453: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.764458: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.764460: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL BROADCAST (rank=0) key=851968, elements=16000, device=0
[2019-12-01 06:56:13.764478: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL BROADCAST (rank=2) key=851968, elements=16000, device=2
[2019-12-01 06:56:13.764481: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.764482: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL BROADCAST (rank=1) key=851968, elements=16000, device=1
[2019-12-01 06:56:13.764483: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.764506: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.764510: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.764647: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.fc1.weight_0
[2019-12-01 06:56:13.764650: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.fc1.weight_0
[2019-12-01 06:56:13.764671: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.fc1.weight_0
[2019-12-01 06:56:13.764676: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.764677: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.764679: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.fc1.weight_0
[2019-12-01 06:56:13.764698: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.764707: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.764975: D byteps/common/operations.cc:229] byteps.Parameter.fc2.bias partitioned to 1 part(s), total_len=40, key_range=[917504, 917504] rank=1
[2019-12-01 06:56:13.764997: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765161: D byteps/common/operations.cc:229] byteps.Parameter.fc2.bias partitioned to 1 part(s), total_len=40, key_range=[917504, 917504] rank=2
[2019-12-01 06:56:13.765188: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765379: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.765404: T byteps/common/operations.cc:264] byteps.Parameter.fc2.bias: open shared memory size 40
[2019-12-01 06:56:13.765420: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765421: D byteps/common/operations.cc:229] byteps.Parameter.fc2.bias partitioned to 1 part(s), total_len=40, key_range=[917504, 917504] rank=0
[2019-12-01 06:56:13.765443: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.bias_0, key=917504, offset=0, len=40, device=1 rank=1
[2019-12-01 06:56:13.765454: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765466: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.765489: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.bias, rank=1
[2019-12-01 06:56:13.765598: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.765635: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.765651: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.765684: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=0, rank=1, key=917504
[2019-12-01 06:56:13.765694: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=917504, myrank=3
[2019-12-01 06:56:13.765835: D byteps/common/operations.cc:229] byteps.Parameter.fc2.bias partitioned to 1 part(s), total_len=40, key_range=[917504, 917504] rank=3
[2019-12-01 06:56:13.765868: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765901: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.765929: T byteps/common/operations.cc:264] byteps.Parameter.fc2.bias: open shared memory size 40
[2019-12-01 06:56:13.765946: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765970: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.bias_0, key=917504, offset=0, len=40, device=2 rank=2
[2019-12-01 06:56:13.765994: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.766011: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.bias, rank=2
[2019-12-01 06:56:13.766136: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.766163: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.766179: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.766212: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=0, rank=2, key=917504
[2019-12-01 06:56:13.766224: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=917504, myrank=3
[2019-12-01 06:56:13.766261: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.766298: T byteps/common/operations.cc:264] byteps.Parameter.fc2.bias: open shared memory size 40
[2019-12-01 06:56:13.766311: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.766336: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.bias_0, key=917504, offset=0, len=40, device=0 rank=0
[2019-12-01 06:56:13.766358: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.766378: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.bias, rank=0
[2019-12-01 06:56:13.766397: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.766438: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.766468: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.766481: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.766496: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=0, rank=0, key=917504
[2019-12-01 06:56:13.766530: T byteps/common/operations.cc[:2019-12-01 06:56:13.766506: T byteps/common/communicator.cc:199] 264root socket recved: src=0, signal=0, key=917504, myrank=3
] byteps.Parameter.fc2.bias: open shared memory size 40
[2019-12-01 06:56:13.766593: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.766624: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.bias_0, key=917504, offset=0, len=40, device=3 rank=3
[2019-12-01 06:56:13.766648: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 3
[2019-12-01 06:56:13.766665: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.bias, rank=3
[2019-12-01 06:56:13.766746: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 3
[2019-12-01 06:56:13.766822: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=917504, myrank=1
[2019-12-01 06:56:13.766823: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=917504, myrank=0
[2019-12-01 06:56:13.766829: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL REDUCE (rank=3) key=917504, elements=10, device=3
[2019-12-01 06:56:13.766834: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=917504, myrank=2
[2019-12-01 06:56:13.766848: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.766849: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.766855: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.766870: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL REDUCE (rank=1) key=917504, elements=10, device=1
[2019-12-01 06:56:13.766872: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL REDUCE (rank=0) key=917504, elements=10, device=0
[2019-12-01 06:56:13.766887: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL REDUCE (rank=2) key=917504, elements=10, device=2
[2019-12-01 06:56:13.766894: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.766905: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.766900: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.766902: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.767018: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767049: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.767052: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767076: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.767089: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.767078: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.767089: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767092: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767131: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 3
[2019-12-01 06:56:13.767133: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.767082: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.767150: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.767154: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767167: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.767169: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.767165: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.767171: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.767197: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767213: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=2, rank=2, key=917504
[2019-12-01 06:56:13.767222: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.767223: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767227: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=917504, myrank=3
[2019-12-01 06:56:13.767241: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.767259: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=2, rank=1, key=917504
[2019-12-01 06:56:13.767268: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=917504, myrank=3
[2019-12-01 06:56:13.767299: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=2, rank=0, key=917504
[2019-12-01 06:56:13.767307: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=917504, myrank=3
[2019-12-01 06:56:13.767400: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 3
[2019-12-01 06:56:13.767475: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=917504, myrank=0
[2019-12-01 06:56:13.767489: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL BROADCAST (rank=3) key=917504, elements=10, device=3
[2019-12-01 06:56:13.767488: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=917504, myrank=1
[2019-12-01 06:56:13.767499: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.767494: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=917504, myrank=2
[2019-12-01 06:56:13.767518: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.767527: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.767527: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL BROADCAST (rank=0) key=917504, elements=10, device=0
[2019-12-01 06:56:13.767534: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL BROADCAST (rank=1) key=917504, elements=10, device=1
[2019-12-01 06:56:13.767549: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.767549: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL BROADCAST (rank=2) key=917504, elements=10, device=2
[2019-12-01 06:56:13.767552: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.767554: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.767573: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.767650: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.fc2.bias_0
[2019-12-01 06:56:13.767672: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.767677: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.fc2.bias_0
[2019-12-01 06:56:13.767683: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.fc2.bias_0
[2019-12-01 06:56:13.767724: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.767740: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.767757: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.fc2.bias_0
[2019-12-01 06:56:13.767780: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.768142: D byteps/common/operations.cc:229] byteps.Parameter.fc2.weight partitioned to 1 part(s), total_len=2000, key_range=[983040, 983040] rank=3
[2019-12-01 06:56:13.768168: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.768469: D byteps/common/operations.cc:229] byteps.Parameter.fc2.weight partitioned to 1 part(s), total_len=2000, key_range=[983040, 983040] rank=1
[2019-12-01 06:56:13.768495: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.768528: T byteps/common/shared_memory.cc:43] initialized share memory size 2000
[2019-12-01 06:56:13.768561: T byteps/common/operations.cc:264] byteps.Parameter.fc2.weight: open shared memory size 2000
[2019-12-01 06:56:13.768561: D byteps/common/operations.cc:229] byteps.Parameter.fc2.weight partitioned to 1 part(s), total_len=2000, key_range=[983040, 983040] rank=2
[2019-12-01 06:56:13.768585: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.768586: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.768619: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.weight_0, key=983040, offset=0, len=2000, device=3 rank=3
[2019-12-01 06:56:13.768645: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 3
[2019-12-01 06:56:13.768654: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.weight, rank=3
[2019-12-01 06:56:13.768875: D byteps/common/operations.cc:229] byteps.Parameter.fc2.weight partitioned to 1 part(s), total_len=2000, key_range=[983040, 983040] rank=0
[2019-12-01 06:56:13.768903: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.769000: T byteps/common/shared_memory.cc:43] initialized share memory size 2000
[2019-12-01 06:56:13.769029: T byteps/common/operations.cc:264] byteps.Parameter.fc2.weight: open shared memory size 2000
[2019-12-01 06:56:13.769047: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.769064: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.weight_0, key=983040, offset=0, len=2000, device=1 rank=1
[2019-12-01 06:56:13.769086: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.769103: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.weight, rank=1
[2019-12-01 06:56:13.769231: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.769261: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.769277: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.769329: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=0, rank=1, key=983040
[2019-12-01 06:56:13.769345: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=983040, myrank=3
[2019-12-01 06:56:13.769365: T byteps/common/shared_memory.cc:43] initialized share memory size 2000
[2019-12-01 06:56:13.769392: T byteps/common/operations.cc:264] byteps.Parameter.fc2.weight: open shared memory size 2000
[2019-12-01 06:56:13.769404: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.769420: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.weight_0, key=983040, offset=0, len=2000, device=2 rank=2
[2019-12-01 06:56:13.769443: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.769467: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.weight, rank=2
[2019-12-01 06:56:13.769482: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.769535: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.769566: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.769599: T byteps/common/shared_memory.cc:43] initialized share memory size 2000
[2019-12-01 06:56:13.769609: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=0, rank=2, key=983040
[2019-12-01 06:56:13.769618: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=983040, myrank=3
[2019-12-01 06:56:13.769627: T byteps/common/operations.cc:264] byteps.Parameter.fc2.weight: open shared memory size 2000
[2019-12-01 06:56:13.769638: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.769655: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.weight_0, key=983040, offset=0, len=2000, device=0 rank=0
[2019-12-01 06:56:13.769686: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.769708: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.weight, rank=0
[2019-12-01 06:56:13.769855: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.769887: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.769897: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.769928: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=0, rank=0, key=983040
[2019-12-01 06:56:13.769941: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=983040, myrank=3
[2019-12-01 06:56:13.770061: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 3
[2019-12-01 06:56:13.770117: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=983040, myrank=0
[2019-12-01 06:56:13.770127: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL REDUCE (rank=3) key=983040, elements=500, device=3
[2019-12-01 06:56:13.770128: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=983040, myrank=1
[2019-12-01 06:56:13.770134: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=983040, myrank=2
[2019-12-01 06:56:13.770150: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.770168: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.770149: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.770184: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL REDUCE (rank=1) key=983040, elements=500, device=1
[2019-12-01 06:56:13.770185: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL REDUCE (rank=2) key=983040, elements=500, device=2
[2019-12-01 06:56:13.770188: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL REDUCE (rank=0) key=983040, elements=500, device=0
[2019-12-01 06:56:13.770185: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.770208: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.770209: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.770211: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.770288: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770308: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.770321: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.770364: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770381: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770391: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.770401: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.770403: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 3
[2019-12-01 06:56:13.770423: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770427: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.770419: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770440: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.770427: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.770445: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.770431: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.770479: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.770479: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=2, rank=1, key=983040
[2019-12-01 06:56:13.770482: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770492: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=983040, myrank=3
[2019-12-01 06:56:13.770498: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.770564: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=2, rank=2, key=983040
[2019-12-01 06:56:13.770562: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.770585: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770580: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=983040, myrank=3
[2019-12-01 06:56:13.770602: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.770645: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=2, rank=0, key=983040
[2019-12-01 06:56:13.770652: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=983040, myrank=3
[2019-12-01 06:56:13.770725: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 3
[2019-12-01 06:56:13.770775: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=983040, myrank=0
[2019-12-01 06:56:13.770785: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL BROADCAST (rank=3) key=983040, elements=500, device=3
[2019-12-01 06:56:13.770785: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=983040, myrank=1
[2019-12-01 06:56:13.770793: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=983040, myrank=2
[2019-12-01 06:56:13.770801: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.770804: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.770810: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.770818: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL BROADCAST (rank=1) key=983040, elements=500, device=1
[2019-12-01 06:56:13.770816: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL BROADCAST (rank=0) key=983040, elements=500, device=0
[2019-12-01 06:56:13.770825: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL BROADCAST (rank=2) key=983040, elements=500, device=2
[2019-12-01 06:56:13.770832: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.770835: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.770837: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.770853: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.770953: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.fc2.weight_0
[2019-12-01 06:56:13.770984: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.fc2.weight_0
[2019-12-01 06:56:13.770996: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.771017: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.771018: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.fc2.weight_0
[2019-12-01 06:56:13.771032: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.fc2.weight_0
[2019-12-01 06:56:13.771053: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.771056: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.773168: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.dampening.0, declared key (not PS key): 16 rank=0
[2019-12-01 06:56:13.773275: D byteps/common/operations.cc:229] byteps.Parameter.dampening.0 partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=0
[2019-12-01 06:56:13.773302: T byteps/common/operations.cc:243] Begin init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.773342: D byteps/common/operations.cc:249] byteps.Parameter.dampening.0 is already on cpu, len=4
[2019-12-01 06:56:13.773357: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.dampening.0, declared key (not PS key): 16 rank=3
[2019-12-01 06:56:13.773455: D byteps/common/operations.cc:229] byteps.Parameter.dampening.0 partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=3
[2019-12-01 06:56:13.773490: T byteps/common/operations.cc:243] Begin init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.773507: D byteps/common/operations.cc:249] byteps.Parameter.dampening.0 is already on cpu, len=4
[2019-12-01 06:56:13.773827: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.dampening.0, declared key (not PS key): 16 rank=2
[2019-12-01 06:56:13.773843: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.dampening.0, declared key (not PS key): 16 rank=1
[2019-12-01 06:56:13.773932: D byteps/common/operations.cc:229] byteps.Parameter.dampening.0 partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=2
[2019-12-01 06:56:13.773946: D byteps/common/operations.cc:229] byteps.Parameter.dampening.0 partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=1
[2019-12-01 06:56:13.773958: T byteps/common/operations.cc:243] Begin init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.773973: D byteps/common/operations.cc:249] byteps.Parameter.dampening.0 is already on cpu, len=4
[2019-12-01 06:56:13.773974: T byteps/common/operations.cc:243] Begin init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.773988: D byteps/common/operations.cc:249] byteps.Parameter.dampening.0 is already on cpu, len=4
[2019-12-01 06:56:13.774463: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 06:56:13.774490: T byteps/common/operations.cc:264] byteps.Parameter.dampening.0: open shared memory size 4
[2019-12-01 06:56:13.774503: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.774524: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.dampening.0_0, key=1048576, offset=0, len=4, device=-1 rank=0
[2019-12-01 06:56:13.774541: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.774575: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.dampening.0, rank=0
[2019-12-01 06:56:13.774605: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.774642: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.774663: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.774695: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=0, rank=0, key=1048576
[2019-12-01 06:56:13.774709: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=1048576, myrank=3
[2019-12-01 06:56:13.775207: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 06:56:13.775242: T byteps/common/operations.cc:264] byteps.Parameter.dampening.0: open shared memory size 4
[2019-12-01 06:56:13.775258: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.775282: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.dampening.0_0, key=1048576, offset=0, len=4, device=-1 rank=3
[2019-12-01 06:56:13.775297: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 3
[2019-12-01 06:56:13.775313: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.dampening.0, rank=3
[2019-12-01 06:56:13.775656: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 06:56:13.775681: T byteps/common/operations.cc:264] byteps.Parameter.dampening.0: open shared memory size 4
[2019-12-01 06:56:13.775696: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.775718: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.dampening.0_0, key=1048576, offset=0, len=4, device=-1 rank=1
[2019-12-01 06:56:13.775733: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.775751: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.dampening.0, rank=1
[2019-12-01 06:56:13.775759: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 06:56:13.775780: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.775791: T byteps/common/operations.cc:264] byteps.Parameter.dampening.0: open shared memory size 4
[2019-12-01 06:56:13.775809: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.775814: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.775825: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.775834: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.dampening.0_0, key=1048576, offset=0, len=4, device=-1 rank=2
[2019-12-01 06:56:13.775856: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.775858: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=0, rank=1, key=1048576
[2019-12-01 06:56:13.775871: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.dampening.0, rank=2
[2019-12-01 06:56:13.775867: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=1048576, myrank=3
[2019-12-01 06:56:13.775930: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.775957: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.775971: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.776017: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=0, rank=2, key=1048576
[2019-12-01 06:56:13.776025: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=1048576, myrank=3
[2019-12-01 06:56:13.776092: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 3
[2019-12-01 06:56:13.776159: T byteps/common/core_loops.cc:165] byteps.Parameter.dampening.0_0 calling NCCL REDUCE (rank=3) key=1048576, elements=1, device=-1
[2019-12-01 06:56:13.776159: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=1
[2019-12-01 06:56:13.776153: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=0
[2019-12-01 06:56:13.776169: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=2
[2019-12-01 06:56:13.776182: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.776189: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.776198: T byteps/common/core_loops.cc:165] byteps.Parameter.dampening.0_0 calling NCCL REDUCE (rank=0) key=1048576, elements=1, device=-1
[2019-12-01 06:56:13.776178: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.776207: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.776213: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.776214: T byteps/common/core_loops.cc:165] byteps.Parameter.dampening.0_0 calling NCCL REDUCE (rank=2) key=1048576, elements=1, device=-1
[2019-12-01 06:56:13.776218: T byteps/common/core_loops.cc:165] byteps.Parameter.dampening.0_0 calling NCCL REDUCE (rank=1) key=1048576, elements=1, device=-1
[2019-12-01 06:56:13.776238: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.776240: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.776316: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776323: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776354: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.776356: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.776371: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.776374: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.776393: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.776420: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.776431: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776444: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776453: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776466: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.776465: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.776464: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.776498: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.776504: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=2, rank=1, key=1048576
[2019-12-01 06:56:13.776504: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=2, rank=0, key=1048576
[2019-12-01 06:56:13.776507: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.776519: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=1048576, myrank=3
[2019-12-01 06:56:13.776535: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776537: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=1048576, myrank=3
[2019-12-01 06:56:13.776555: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.776592: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=2, rank=2, key=1048576
[2019-12-01 06:56:13.776603: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=1048576, myrank=3
[2019-12-01 06:56:13.926863: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20: 402 Aborted (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/byteps/launcher/launch.py", line 41, in worker
subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134
@nowei Thank you. You are right. INFO does not give anything new. The useful level is DEBUG. However, TRACE would include anything that DEBUG outputs, so what you have is good enough for us. We'll look into this.
@nowei If you repeat multiple times with TRACE logs, does it always die on the key 1048576
? From the logs you paste, you can see that the last few lines are all about key 1048576
.
I ran it a few more times and they all died on key 1048576.
Thanks. This is very helpful. So, it's a deterministic bug. There has to be something special about this tensor byteps.Parameter.dampening.0_0
@nowei Would you do one more favor? Comment out this line and try again. https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L109
It ended up training for one epoch and then it crashed again. It ended with something like this:
Train Epoch: 1 [14720/15000 (98%)] Loss: 0.529198
Train Epoch: 1 [14720/15000 (98%)] Loss: 0.421403
Train Epoch: 1 [14720/15000 (98%)] Loss: 0.339793
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
[2019-12-01 09:17:47.531626: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20: 32 Aborted (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/byteps/launcher/launch.py", line 41, in worker
subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134
Here's the last few lines from the debug info. It seems like it died again on key 1048576.
[2019-12-01 09:20:21.889601: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 09:20:21.889606: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 09:20:21.889690: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Gradient.conv1.weight_0
[2019-12-01 09:20:21.889756: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Gradient.conv1.weight_0
[2019-12-01 09:20:21.889757: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Gradient.conv1.weight_0
[2019-12-01 09:20:21.889757: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:20:21.889762: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Gradient.conv1.weight_0
[2019-12-01 09:20:21.889816: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:20:21.889818: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:20:21.889819: T byteps/common/core_loops.cc:308] Finished NCCL Group size=2 rank=2
[2019-12-01 09:20:21.889819: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:20:21.889868: T byteps/common/core_loops.cc:308] Finished NCCL Group size=2 rank=3
[2019-12-01 09:20:21.889864: T byteps/common/core_loops.cc:308] Finished NCCL Group size=2 rank=1
[2019-12-01 09:20:21.889880: T byteps/common/core_loops.cc:308] Finished NCCL Group size=2 rank=0
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
[2019-12-01 09:20:22.415743: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=0
[2019-12-01 09:20:22.415922: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=0
[2019-12-01 09:20:22.415955: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.415975: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:20:22.416885: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:20:22.416914: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:20:22.416939: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.416965: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=0
[2019-12-01 09:20:22.416984: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.416993: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=0
[2019-12-01 09:20:22.417095: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.417130: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.417151: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.417189: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=0, key=1048576
[2019-12-01 09:20:22.417352: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=1048576, myrank=3
[2019-12-01 09:20:22.417706: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=3
[2019-12-01 09:20:22.417844: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=3
[2019-12-01 09:20:22.417889: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.417906: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:20:22.418722: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:20:22.418754: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:20:22.418772: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.418797: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=3
[2019-12-01 09:20:22.418823: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 3
[2019-12-01 09:20:22.418852: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=3
[2019-12-01 09:20:22.419510: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=2
[2019-12-01 09:20:22.419631: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=2
[2019-12-01 09:20:22.419679: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.419703: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:20:22.420484: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:20:22.420512: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:20:22.420527: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.420550: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=2
[2019-12-01 09:20:22.420569: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.420585: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=2
[2019-12-01 09:20:22.420653: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.420697: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.420708: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.420748: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=2, key=1048576
[2019-12-01 09:20:22.420772: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=1048576, myrank=3
[2019-12-01 09:20:22.464518: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=1
[2019-12-01 09:20:22.464642: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=1
[2019-12-01 09:20:22.464723: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.464744: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:20:22.465526: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:20:22.465557: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:20:22.465573: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.465600: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=1
[2019-12-01 09:20:22.465616: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.465626: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=1
[2019-12-01 09:20:22.465731: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.465765: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.465785: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.465825: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=1, key=1048576
[2019-12-01 09:20:22.465863: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=1048576, myrank=3
[2019-12-01 09:20:22.465868: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 3
[2019-12-01 09:20:22.465977: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=0
[2019-12-01 09:20:22.466006: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.466012: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=3) key=1048576, elements=1, device=-1
[2019-12-01 09:20:22.466023: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=0) key=1048576, elements=1, device=-1
[2019-12-01 09:20:22.466021: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=1
[2019-12-01 09:20:22.466043: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.466046: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 09:20:22.466035: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=2
[2019-12-01 09:20:22.466060: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.466060: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=1) key=1048576, elements=1, device=-1
[2019-12-01 09:20:22.466076: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=2) key=1048576, elements=1, device=-1
[2019-12-01 09:20:22.466092: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 09:20:22.466103: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 09:20:22.466138: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 09:20:22.466170: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466224: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.466238: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466294: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 09:20:22.466311: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.466326: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 09:20:22.466354: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.466361: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.466384: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466367: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466400: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466408: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.466416: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.466421: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.466438: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 09:20:22.466451: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=1, key=1048576
[2019-12-01 09:20:22.466451: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=0, key=1048576
[2019-12-01 09:20:22.466481: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=1048576, myrank=3
[2019-12-01 09:20:22.466516: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=1048576, myrank=3
[2019-12-01 09:20:22.466557: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.466594: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466613: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.466668: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=2, key=1048576
[2019-12-01 09:20:22.466679: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=1048576, myrank=3
[2019-12-01 09:20:22.555950: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20: 368 Aborted (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/byteps/launcher/launch.py", line 41, in worker
subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134
Okay. It dies because of this now https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L157
It seems that your K80 GPUs have problems dealing with the tensors are are not inside the model. I suspect that these tensors are placed on CPU by PyTorch, and K80, as an older GPU, cannot properly map the CPU memory address into GPU memory address space. Consequently, NCCL complains that the given memory address is invalid. We'll see how to address this.
There is one simple way to verify this -- add test_loss.cuda()
right before https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L157
I think test loss is a float at that point because it was a float initially, since I'm getting
AttributeError: 'float' object has no attribute 'cuda'
I'll try making it a tensor initially and send it to cuda and see what happens.
@nowei You are right.. you can do it here https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L132
It's still dying there:
[2019-12-01 09:49:15.741209: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Gradient.conv1.bias_0 key: 0 rank: 0
[2019-12-01 09:49:15.741213: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Gradient.conv1.bias_0 key: 0 rank: 1
[2019-12-01 09:49:15.741224: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Gradient.conv1.bias_0 key: 0 rank: 2
[2019-12-01 09:49:15.741228: T byteps/common/core_loops.cc:165] byteps.Gradient.conv1.bias_0 calling NCCL BROADCAST (rank=0) key=0, elements=10, device=0
[2019-12-01 09:49:15.741233: T byteps/common/core_loops.cc:165] byteps.Gradient.conv1.bias_0 calling NCCL BROADCAST (rank=1) key=0, elements=10, device=1
[2019-12-01 09:49:15.741259: T byteps/common/core_loops.cc:165] byteps.Gradient.conv1.bias_0 calling NCCL BROADCAST (rank=2) key=0, elements=10, device=2
[2019-12-01 09:49:15.741267: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 09:49:15.741270: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 09:49:15.741280: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 09:49:15.741283: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 09:49:15.741373: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:49:15.741432: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:49:15.741431: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 09:49:15.741457: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:49:15.741473: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:49:15.741492: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 09:49:15.741510: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 09:49:15.741532: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
[2019-12-01 09:49:16.267467: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=0
[2019-12-01 09:49:16.267625: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=0
[2019-12-01 09:49:16.267664: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.267681: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:49:16.268385: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=2
[2019-12-01 09:49:16.268500: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:49:16.268508: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=2
[2019-12-01 09:49:16.268532: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:49:16.268556: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.268559: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.268572: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:49:16.268577: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=0
[2019-12-01 09:49:16.268590: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.268610: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=0
[2019-12-01 09:49:16.268777: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.268820: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.268844: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.268891: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=0, key=1048576
[2019-12-01 09:49:16.268942: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=1048576, myrank=3
[2019-12-01 09:49:16.269418: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:49:16.269448: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:49:16.269465: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.269500: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=2
[2019-12-01 09:49:16.269522: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.269539: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=2
[2019-12-01 09:49:16.269688: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.269718: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.269737: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.269770: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=2, key=1048576
[2019-12-01 09:49:16.269785: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=1048576, myrank=3
[2019-12-01 09:49:16.311922: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=3
[2019-12-01 09:49:16.312066: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=3
[2019-12-01 09:49:16.312098: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.312116: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:49:16.312944: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:49:16.312973: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:49:16.312994: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.313015: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=3
[2019-12-01 09:49:16.313032: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 3
[2019-12-01 09:49:16.313059: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=3
[2019-12-01 09:49:16.319924: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=1
[2019-12-01 09:49:16.320052: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=1
[2019-12-01 09:49:16.320084: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.320103: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:49:16.320931: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:49:16.320959: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:49:16.320973: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.320991: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=1
[2019-12-01 09:49:16.321007: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321022: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=1
[2019-12-01 09:49:16.321140: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321173: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321191: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321233: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=1, key=1048576
[2019-12-01 09:49:16.321271: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 3
[2019-12-01 09:49:16.321271: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=1048576, myrank=3
[2019-12-01 09:49:16.321373: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=3) key=1048576, elements=1, device=-1
[2019-12-01 09:49:16.321375: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=0
[2019-12-01 09:49:16.321383: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=1
[2019-12-01 09:49:16.321403: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.321397: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=2
[2019-12-01 09:49:16.321418: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321426: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=0) key=1048576, elements=1, device=-1
[2019-12-01 09:49:16.321428: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.321448: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=1) key=1048576, elements=1, device=-1
[2019-12-01 09:49:16.321457: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 09:49:16.321475: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=2) key=1048576, elements=1, device=-1
[2019-12-01 09:49:16.321494: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 09:49:16.321513: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 09:49:16.321515: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 09:49:16.321694: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321694: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321725: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321725: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.321713: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321750: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 09:49:16.321751: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.321750: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 09:49:16.321767: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 09:49:16.321799: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.321818: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.321836: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321842: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321863: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.321867: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.321886: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321906: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=2, key=1048576
[2019-12-01 09:49:16.321912: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321908: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=0, key=1048576
[2019-12-01 09:49:16.321922: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=1048576, myrank=3
[2019-12-01 09:49:16.321937: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321945: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=1048576, myrank=3
[2019-12-01 09:49:16.322000: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=1, key=1048576
[2019-12-01 09:49:16.322019: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=1048576, myrank=3
[2019-12-01 09:49:16.408270: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20: 2332 Aborted (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/byteps/launcher/launch.py", line 41, in worker
subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134
I also tried sending the tensor to cuda and it still gets the same error. Wait, I had to save the tensor back to itself and I think it's running more, I'm turning off the debug info to see if it actually makes it past one epoch.
Oh, hey, it's training! Thanks!
@nowei Thanks. Apparently, Pascal architecture has some improvement over Kelper on the CPU/GPU memory management (https://devblogs.nvidia.com/unified-memory-cuda-beginners/) so this particular problem does not appear in our own environment. We will improve the code robustness.
Describe the bug Running on multiple GPUs fails on docker image built on AWS EC2 Spot instance. Specifically, I was running an instance of p2.8xlarge.
To Reproduce Steps to reproduce the behavior: I tried to follow https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md as closely as possible. I'm pretty sure nvidia changed
nvidia-docker
todocker --gpus
and I ended up usingnvidia-driver-430
. So I had to use something likedocker run -it --net=host --gpus all --shm-size=32768m bytepsimage/worker_pytorch bash
instead ofnvidia-docker
.Anyways, I go through the rest of the step-by-step guide and it works out okay and I can run it with
EVAL_TYPE=benchmark
, but when I switch toEVAL_TYPE=mnist
, it breaks. When I run it withexport NVIDIA_VISIBLE_DEVICES=0
, it seems to work out fine except for the odd accuracies, which can be fixed by commenting outinside of
example/pytorch/train_mnist_byteps.py
.Expected behavior I guess I expected it to run like it did in the one gpu case, but I may have been too optimistic
Screenshots Picture of it working on the benchmark Picture of it dying on MNIST Picture of it running with one GPU Really janky accuracy
Environment (please complete the following information): I don't think the AWS stuff is relevant, but I thought I should include it just in case. OS:
GCC version:
CUDA and NCCL version:
Framework (TF, PyTorch, MXNet): PyTorch
Additional context I tried something similar with the docker image for tensorflow and it seemed to work with multiple gpus, but it never printed any accuracies because I don't think the code had any accuracy statements in it. Also, for that image you either had to install everything with python3 or change the shell file to execute using python2. Any help would be appreciated!