Open ryantd opened 3 years ago
Did you see any OOM?
Did you see any OOM?
@classicsong No. And I noticed,
Did you check if the server is actually running?
Did you check if the server is actually running?
Yes
Update:
I have already traced into dgl void SocketReceiver::RecvLoop(TCPSocket* socket, MessageQueue* queue) {..}
. And I found the reinterpret_cast<char*>(&data_size)
got nothing, so socket->Receive(...)
returned -1.
Followed https://aws-dglke.readthedocs.io/en/latest/dist_train.html, and got an error of
recv
The only thing I did, is that I revised the
tcp_socket.cc
to show the error detailand my training worker (2 training containers, each) has 30 cores and 150Gi mem
my analysis so far
the
recv
error may be raised when_sender_connect
is triggeredhttps://github.com/dmlc/dgl/blob/5626058a5a658deb8338c3da5f27252c61507223/python/dgl/contrib/dis_kvstore.py#L666