/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
Traceback (most recent call last):
File "/usr/local/bin/dglke_server", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/dglke/kvserver.py", line 232, in main
start_server(args)
File "/usr/local/lib/python3.6/site-packages/dglke/kvserver.py", line 227, in start_server
my_server.start()
File "/usr/local/lib64/python3.6/site-packages/dgl/contrib/dis_kvstore.py", line 509, in start
_sender_connect(self._sender)
File "/usr/local/lib64/python3.6/site-packages/dgl/network.py", line 98, in _sender_connect
_CAPI_DGLSenderConnect(sender)
File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/base.py", line 62, in check_call
raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: Resource temporarily unavailable
File "/usr/local/lib/python3.6/site-packages/dglke/models/pytorch/tensor_models.py", line 77, in decorated_function
raise exception.__class__(trace)
dgl._ffi.base.DGLError: Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/dglke/models/pytorch/tensor_models.py", line 65, in _queue_result
res = func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/dglke/train_pytorch.py", line 1492, in dist_train_test
client = connect_to_kvstore(args, entity_pb, relation_pb, l2g)
File "/usr/local/lib/python3.6/site-packages/dglke/train_pytorch.py", line 1111, in connect_to_kvstore
my_client.connect()
File "/usr/local/lib64/python3.6/site-packages/dgl/contrib/dis_kvstore.py", line 953, in connect
_receiver_wait(self._receiver, client_ip, int(client_port), self._server_count)
File "/usr/local/lib64/python3.6/site-packages/dgl/network.py", line 116, in _receiver_wait
_CAPI_DGLReceiverWait(receiver, ip_addr, int(port), int(num_sender))
File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/base.py", line 62, in check_call
raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: Resource temporarily unavailable
terminate called after throwing an instance of 'dmlc::Error'
what(): [11:13:56] /opt/dgl/src/graph/network/socket_communicator.cc:144: Check failed: tmp != -1 (-1 vs. -1) :
Stack trace:
[bt] (0) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(dgl::network::SocketSender::SendLoop(dgl::network::TCPSocket*, dgl::network::MessageQueue*)+0x7a6) [0x7f3002adce16]
[bt] (1) /lib64/libstdc++.so.6(+0xb5070) [0x7f305d5ee070]
[bt] (2) /lib64/libpthread.so.0(+0x7dd5) [0x7f306f1f6dd5]
[bt] (3) /lib64/libc.so.6(clone+0x6d) [0x7f306e816ead]
When I tried the following command, I found that the number of servers and clients were different on each machine:
I am running the following command on a cluster of 4 machines.
I got following errors:
When I tried the following command, I found that the number of servers and clients were different on each machine:
Experimental configuration:
When I try to change ''--num_client_proc 40'' to ''--num_client_proc 8 '' or less, it works fine.