Open wangsiyu opened 5 years ago
Hi @wangsiyu can you please try with NCCL 2.4.7 and see if it fixes the issue? Thanks!
I have the same issue. @wangsiyu you solve it? Thanks in advance.
I worked around this problem by NCCL_LL_THRESHOLD=0. It seems that it is a bug in 2.4.2. Although I did not upgrade nccl version. But I think this bug has been fixed in 2.4.6 according to its release note. Thanks very much!
I can reproduce it on 2.4.7
Can you explain more in details what you could reproduce with 2.4.7 ? Do you have a hang, which can be solved by setting NCCL_LL_THRESHOLD=0 ? Is it a hang showing one thread is blocked in recv() ?
Many things could cause hangs, so maybe it would be good to open another bug to make sure it is the same issue. Also note, NCCL 2.5 is the latest stable version you might want to try, and a preview of the NCCL 2.6 is also available on the v2.6 branch : https://github.com/nvidia/nccl/tree/v2.6. Feel free to give it a try. Thanks !
I use gdb
to print the stack trace. It's blocked at
#0 syscall ... x86_64/syscall.S:38
#1 nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec)
from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
Sorry that I cannot paste the full trace. By the way, NCCL_LL_THRESHOLD=0
fixed the hang on three node settings. 10 nodes still fail occasionally.
Can you describe with a bit more details the problem you are facing ? Do you have a hang, does it reproduce every time, immediately/after some time ?
Also pasting the NCCL debug would help (at least set NCCL_DEBUG=WARN to double check the NCCL version, and any warning).
It always hangs during the training, but at different steps each time. I set NCCL_DEBUG=INFO
get the following message before the hang.
[0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/SOCKET : message truncated : receiving 1048576 bytes instead of 32768
[0] NCCL INFO bazel-out/k8-py2-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:34 -> 3
[0] NCCL INFO external/nccl_archive/src/transport/net.cc:533 -> 3
[0] NCCL INFO external/nccl_archive/src/transport.cc:163 -> 3 [Proxy Thread]
@372046933 Could you be using different versions of NCCL on different nodes or have compiled NCCL from files of different versions? I just notice your error message has different NCCL home paths. As mentioned by @sjeaugey it would be nice to confirm the NCCL version on different nodes.
Another possibility is having set different values for some environment variables on different nodes. For example, one could have set NCCL_LL_THRESHOLD=0
on one node while not doing so on other nodes. From the message truncation warning you are getting, that is likely the cause , i.e. 1048576 = 1MB is the message size used for the Simple algorithm (when NCCL_LL_THRESHOLD
is forced to 0), whereas 32768 = 32KB could be the message size used for the LL algorithm (when NCCL_LL_THRESHOLD
is not set).
Well, I run the script in docker, both containers have the same image and ENV set by Kubernetes. I have checked that the Docker image checksum is same on all nodes. By the way, TensorFlow 2.0 compiled NCCL 2.4.7 together.
Thanks for the check. The next thing I would check is whether the ranks make the collective call with different sizes.
Hi, I am an engineer in Alibaba Group and encounter some problems in using nccl in recent days. When I run nccl 2.4.2 across multiple nodes, the program hangs randomly.
nvidia-smi
shows the GPU utility is always 100%. And CPU utility is not 0I use
gdb -p
to the hanged process on each nodes, I found some threads stop at socket recv(). At the same time,NCCL_DEBUG=WARN
does not show more useful information. It just shows IB detection failed, but hang was happened after training for a long time. Is this a bug in nccl or the inappropriate usage leads to this hang?