NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.19k stars 805 forks source link

NCCL WARN socketProgress: Connection closed by remote peer #1486

Open ganyu1992 opened 4 days ago

ganyu1992 commented 4 days ago

Hi, I got socketProgress: Connection closed by remote peer when execute ncclAllToAll via ncclSend & ncclRecv. I noticed that if NCCL_SOCKET_RECV zero bytes, it will close the socket:

if (op == NCCL_SOCKET_RECV && bytes == 0) {
  *closed = 1;
  return ncclSuccess;
}

code: https://github.com/NVIDIA/nccl/blob/master/src/misc/socket.cc#L24

I wonder why close the socket here when recv zero bytes? I can't guarantee that all GPUs have data to send, zero len data is possible. Could I modify the code here?

  *closed = 0;
kiskra-nvidia commented 3 days ago

I'm trying to understand the scenario that doesn't work for you. Are you saying that you invoke ncclRecv with a count of 0? Is there a corresponding ncclSend as well? A reproducer code would help...

kiskra-nvidia commented 3 days ago

In general though, the code you quoted is correct as-is. recv for a stream (TCP) socket normally returns 0 only if the remote peer has closed the socket (a 0 return from recv is an end-of-file indicator). So I don't expect that the modification you propose would improve anything for you. We need to understand what the underlying issue is, and for that we need a reproducer and a complete output with NCCL_DEBUG=INFO. Was this a run at large scale?