Open 913871734 opened 6 days ago
The specific message you're seeing (Software caused connection abort
) is due to a bug in NCCL that's already been fixed for the next release. However, that bug is most likely a secondary effect here and not the true source of your problems (it's a bug in the error handling code -- but what caused the error in the first place?). Have you tried running NCCL with NCCL_DEBUG=INFO
environment variable set? If it's still reproducible then, we'd like to see the debug output it produces. If you can't reproduce it with INFO
(which could happen if it's a race condition), try with the significantly less verbose NCCL_DEBUG=WARN
.
In principle modifying and recompiling NCCL is easy, and indeed it should be enough to replace the single libnccl.so.2
file with the new version. Given the difficulties you described, I suggest that you make sure that the new version is included in both running pods, check with ldd
that the dynamic loader is in fact loading the library from the location you expect (you may want to double-check at run time with something like grep libnccl /proc/pid/maps
), and finally make sure that libnccl.so.2
(which is typically a soft link) points to your modified variant.
Are you saying that setting NCCL_DEBUG=INFO
does not generate tons of debug output for you, at least on startup? I don't know the details of your set-up, but it should, so I'm guessing that that output must be going somewhere in your case, maybe simply not where you expect? You could try passing something like NCCL_DEBUG_FILE=$HOME/nccl_debug.%h.%p
, which should ensure that the output from each NCCL process goes to a separate file in $HOME
.
Unfortunately the fix for the "software caused connection abort" bug is not a one-liner and extracting it from the ~500 lines of changes to misc/socket.cc
that we've accumulated for the next release is nontrivial. But the problem was basically that in case of ECONNREFUSED
and ETIMEDOUT
errors from the first call to connect
in socketStartConnect
, NCCL should've been closing the socket and opening a new one before retrying. Because it wasn't, on the next call to connect
, at least if the socket was nonblocking, connect
would fail with ECONNABORTED
. The same problem was present in socketPollConnect
.
The question is though: why were you getting ECONNREFUSED
or ETIMEDOUT
in the first place?
I met a tricky question. when i run a mission, it sometimes report the errors as following:
socketStartConnect: Connect to 10.45.234.83<47527> failed : Software caused connection abort
This problem doesn't occur 100% of the time. It is a high probability that nine out of ten runs will occur. I have check the basic network, it is ok. I could use nc to connect between the two pods and ping-pass.
What's even more strange is that I tried to modify the misc/socket.cc file and recompile the new libnccl.so to overwrite the previous libnccl.so. However, I found that the error information reported by the task was inconsistent with the information I newly compiled, as if the task did not actually use the libnccl.so I just compiled, but when I ran all_reduce_perf, some log information I compiled could be printed out. Please help me, I don't have any clue anymore...