Open ganyu1992 opened 4 days ago
I'm trying to understand the scenario that doesn't work for you. Are you saying that you invoke ncclRecv
with a count
of 0
? Is there a corresponding ncclSend
as well? A reproducer code would help...
In general though, the code you quoted is correct as-is. recv
for a stream (TCP) socket normally returns 0
only if the remote peer has closed the socket (a 0
return from recv
is an end-of-file indicator). So I don't expect that the modification you propose would improve anything for you. We need to understand what the underlying issue is, and for that we need a reproducer and a complete output with NCCL_DEBUG=INFO
. Was this a run at large scale?
Hi, I got socketProgress: Connection closed by remote peer when execute ncclAllToAll via ncclSend & ncclRecv. I noticed that if NCCL_SOCKET_RECV zero bytes, it will close the socket:
code: https://github.com/NVIDIA/nccl/blob/master/src/misc/socket.cc#L24
I wonder why close the socket here when recv zero bytes? I can't guarantee that all GPUs have data to send, zero len data is possible. Could I modify the code here?