NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.26k stars 827 forks source link

Nccl socketStartConnect: Connect to x.x.x.x<xxxx> failed : Software caused connection abort #1515

Open 913871734 opened 6 days ago

913871734 commented 6 days ago

I met a tricky question. when i run a mission, it sometimes report the errors as following:

socketStartConnect: Connect to 10.45.234.83<47527> failed : Software caused connection abort Image

This problem doesn't occur 100% of the time. It is a high probability that nine out of ten runs will occur. I have check the basic network, it is ok. I could use nc to connect between the two pods and ping-pass.

What's even more strange is that I tried to modify the misc/socket.cc file and recompile the new libnccl.so to overwrite the previous libnccl.so. However, I found that the error information reported by the task was inconsistent with the information I newly compiled, as if the task did not actually use the libnccl.so I just compiled, but when I ran all_reduce_perf, some log information I compiled could be printed out. Please help me, I don't have any clue anymore...

kiskra-nvidia commented 4 days ago

The specific message you're seeing (Software caused connection abort) is due to a bug in NCCL that's already been fixed for the next release. However, that bug is most likely a secondary effect here and not the true source of your problems (it's a bug in the error handling code -- but what caused the error in the first place?). Have you tried running NCCL with NCCL_DEBUG=INFO environment variable set? If it's still reproducible then, we'd like to see the debug output it produces. If you can't reproduce it with INFO (which could happen if it's a race condition), try with the significantly less verbose NCCL_DEBUG=WARN.

In principle modifying and recompiling NCCL is easy, and indeed it should be enough to replace the single libnccl.so.2 file with the new version. Given the difficulties you described, I suggest that you make sure that the new version is included in both running pods, check with ldd that the dynamic loader is in fact loading the library from the location you expect (you may want to double-check at run time with something like grep libnccl /proc/pid/maps), and finally make sure that libnccl.so.2 (which is typically a soft link) points to your modified variant.

913871734 commented 3 days ago
  1. The screenshot above is the total output after setting NCCL_DEBUG=INFO. There is no abnormal message more, I did not observe any other useful messages, which is also a point that bothers me a lot.
  2. I am very curious about what is the bug(Software caused connection abort) you mentioned and how the bug was fixed?
  3. I am very sure that the libnccl.so.2 I compiled replaced the original dynamic link library, because I asked these pods to execute all_reduce_perf before running the task. I observed the output log to confirm that the libnccl.so.2 I compiled was called normally before executing the task. Looking at the error stack, it was caused by the upper-level broadcast. Will these operations call other dynamic link libraries?
kiskra-nvidia commented 3 days ago

Are you saying that setting NCCL_DEBUG=INFO does not generate tons of debug output for you, at least on startup? I don't know the details of your set-up, but it should, so I'm guessing that that output must be going somewhere in your case, maybe simply not where you expect? You could try passing something like NCCL_DEBUG_FILE=$HOME/nccl_debug.%h.%p, which should ensure that the output from each NCCL process goes to a separate file in $HOME.

Unfortunately the fix for the "software caused connection abort" bug is not a one-liner and extracting it from the ~500 lines of changes to misc/socket.cc that we've accumulated for the next release is nontrivial. But the problem was basically that in case of ECONNREFUSED and ETIMEDOUT errors from the first call to connect in socketStartConnect, NCCL should've been closing the socket and opening a new one before retrying. Because it wasn't, on the next call to connect, at least if the socket was nonblocking, connect would fail with ECONNABORTED. The same problem was present in socketPollConnect.

The question is though: why were you getting ECONNREFUSED or ETIMEDOUT in the first place?