"Resource temporarily unavailable" when distributed training on full Freebase

I am running the following command on a cluster of 4 machines.

DGLBACKEND=pytorch dglke_dist_train --path ~/my_task --ip_config ~/my_task/ip_config8.txt \
--num_client_proc 40 --model TransE_l2 --dataset Freebase --data_path ~/my_task --hidden_dim 128 \
--gamma 10.0 --lr 0.1 --batch_size 1024 --neg_sample_size 256 --max_step 12800 --log_interval 256 \
--batch_size_eval 1024 --neg_sample_size_eval 1024 --test -adv --regularization_coef 1.00E-09 \
--no_save_emb --num_thread 1 >> fb-dglke.txt

I got following errors:

/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
Traceback (most recent call last):
  File "/usr/local/bin/dglke_server", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/dglke/kvserver.py", line 232, in main
    start_server(args)
  File "/usr/local/lib/python3.6/site-packages/dglke/kvserver.py", line 227, in start_server
    my_server.start()
  File "/usr/local/lib64/python3.6/site-packages/dgl/contrib/dis_kvstore.py", line 509, in start
    _sender_connect(self._sender)
  File "/usr/local/lib64/python3.6/site-packages/dgl/network.py", line 98, in _sender_connect
    _CAPI_DGLSenderConnect(sender)
  File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/base.py", line 62, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: Resource temporarily unavailable

File "/usr/local/lib/python3.6/site-packages/dglke/models/pytorch/tensor_models.py", line 77, in decorated_function
    raise exception.__class__(trace)
dgl._ffi.base.DGLError: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/dglke/models/pytorch/tensor_models.py", line 65, in _queue_result
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/dglke/train_pytorch.py", line 1492, in dist_train_test
    client = connect_to_kvstore(args, entity_pb, relation_pb, l2g)
  File "/usr/local/lib/python3.6/site-packages/dglke/train_pytorch.py", line 1111, in connect_to_kvstore
    my_client.connect()
  File "/usr/local/lib64/python3.6/site-packages/dgl/contrib/dis_kvstore.py", line 953, in connect
    _receiver_wait(self._receiver, client_ip, int(client_port), self._server_count)
  File "/usr/local/lib64/python3.6/site-packages/dgl/network.py", line 116, in _receiver_wait
    _CAPI_DGLReceiverWait(receiver, ip_addr, int(port), int(num_sender))
  File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/base.py", line 62, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: Resource temporarily unavailable

terminate called after throwing an instance of 'dmlc::Error'
  what():  [11:13:56] /opt/dgl/src/graph/network/socket_communicator.cc:144: Check failed: tmp != -1 (-1 vs. -1) :
Stack trace:
  [bt] (0) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(dgl::network::SocketSender::SendLoop(dgl::network::TCPSocket*, dgl::network::MessageQueue*)+0x7a6) [0x7f3002adce16]
  [bt] (1) /lib64/libstdc++.so.6(+0xb5070) [0x7f305d5ee070]
  [bt] (2) /lib64/libpthread.so.0(+0x7dd5) [0x7f306f1f6dd5]
  [bt] (3) /lib64/libc.so.6(clone+0x6d) [0x7f306e816ead]

When I tried the following command, I found that the number of servers and clients were different on each machine:

ps -ef | grep dglke_server | grep -v grep | wc -l (result: 8, 7, 8, 8)
ps -ef | grep dglke_client | grep -v grep | wc -l (result: 161, 106, 148, 110)

Experimental configuration:

python 3.6.8, dgl 0.4.3, dglke 0.1.0 each machine has 512G memory

When I try to change ''--num_client_proc 40'' to ''--num_client_proc 8 '' or less, it works fine.

awslabs / dgl-ke

"Resource temporarily unavailable" when distributed training on full Freebase #240