awslabs / dgl-ke

High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.
https://dglke.dgl.ai/doc/
Apache License 2.0
1.28k stars 196 forks source link

"Resource temporarily unavailable" when calling `recv` in distributed training #229

Open ryantd opened 3 years ago

ryantd commented 3 years ago

Followed https://aws-dglke.readthedocs.io/en/latest/dist_train.html, and got an error of recv

/dgl_workspace/dgl/src/graph/network/tcp_socket.cc:180: recv error: Resource temporarily unavailable
terminate called after throwing an instance of 'dmlc::Error'
  what():  [04:04:56] /dgl_workspace/dgl/src/graph/network/socket_communicator.cc:282: Check failed: tmp != -1 (-1 vs. -1) : 
Stack trace:
  [bt] (0) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f55a8961b2f]
  [bt] (1) /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f55af39efa3]
  [bt] (2) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f55af1424cf]

The only thing I did, is that I revised the tcp_socket.cc to show the error detail

and my training worker (2 training containers, each) has 30 cores and 150Gi mem

my analysis so far

the recv error may be raised when _sender_connect is triggered

https://github.com/dmlc/dgl/blob/5626058a5a658deb8338c3da5f27252c61507223/python/dgl/contrib/dis_kvstore.py#L666

classicsong commented 3 years ago

Did you see any OOM?

ryantd commented 3 years ago

Did you see any OOM?

@classicsong No. And I noticed,

  1. the error was raised when the client tries to connect servers, before the actual training.
  2. there was no clew on OOM described in the containers' monitor.
classicsong commented 3 years ago

Did you check if the server is actually running?

ryantd commented 3 years ago

Did you check if the server is actually running?

Yes

ryantd commented 3 years ago

Update:

I have already traced into dgl void SocketReceiver::RecvLoop(TCPSocket* socket, MessageQueue* queue) {..}. And I found the reinterpret_cast<char*>(&data_size) got nothing, so socket->Receive(...) returned -1.