NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.18k stars 802 forks source link

NCCL hang on socket recv() #238

Open wangsiyu opened 5 years ago

wangsiyu commented 5 years ago

Hi, I am an engineer in Alibaba Group and encounter some problems in using nccl in recent days. When I run nccl 2.4.2 across multiple nodes, the program hangs randomly. nvidia-smi shows the GPU utility is always 100%. image And CPU utility is not 0 image

I use gdb -p to the hanged process on each nodes, I found some threads stop at socket recv(). image At the same time, NCCL_DEBUG=WARN does not show more useful information. It just shows IB detection failed, but hang was happened after training for a long time. image Is this a bug in nccl or the inappropriate usage leads to this hang?

kwen2501 commented 5 years ago

Hi @wangsiyu can you please try with NCCL 2.4.7 and see if it fixes the issue? Thanks!

Mykheievskyi commented 5 years ago

I have the same issue. @wangsiyu you solve it? Thanks in advance.

wangsiyu commented 5 years ago

I worked around this problem by NCCL_LL_THRESHOLD=0. It seems that it is a bug in 2.4.2. Although I did not upgrade nccl version. But I think this bug has been fixed in 2.4.6 according to its release note. Thanks very much!

372046933 commented 4 years ago

I can reproduce it on 2.4.7

sjeaugey commented 4 years ago

Can you explain more in details what you could reproduce with 2.4.7 ? Do you have a hang, which can be solved by setting NCCL_LL_THRESHOLD=0 ? Is it a hang showing one thread is blocked in recv() ?

Many things could cause hangs, so maybe it would be good to open another bug to make sure it is the same issue. Also note, NCCL 2.5 is the latest stable version you might want to try, and a preview of the NCCL 2.6 is also available on the v2.6 branch : https://github.com/nvidia/nccl/tree/v2.6. Feel free to give it a try. Thanks !

372046933 commented 4 years ago

I use gdb to print the stack trace. It's blocked at

#0 syscall ... x86_64/syscall.S:38
#1 nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec)
from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so

Sorry that I cannot paste the full trace. By the way, NCCL_LL_THRESHOLD=0 fixed the hang on three node settings. 10 nodes still fail occasionally.

sjeaugey commented 4 years ago

Can you describe with a bit more details the problem you are facing ? Do you have a hang, does it reproduce every time, immediately/after some time ?

Also pasting the NCCL debug would help (at least set NCCL_DEBUG=WARN to double check the NCCL version, and any warning).

372046933 commented 4 years ago

It always hangs during the training, but at different steps each time. I set NCCL_DEBUG=INFO get the following message before the hang.

[0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/SOCKET : message truncated : receiving 1048576 bytes instead of 32768
[0] NCCL INFO bazel-out/k8-py2-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:34 -> 3
[0] NCCL INFO external/nccl_archive/src/transport/net.cc:533 -> 3
[0] NCCL INFO external/nccl_archive/src/transport.cc:163 -> 3 [Proxy Thread]
kwen2501 commented 4 years ago

@372046933 Could you be using different versions of NCCL on different nodes or have compiled NCCL from files of different versions? I just notice your error message has different NCCL home paths. As mentioned by @sjeaugey it would be nice to confirm the NCCL version on different nodes.

Another possibility is having set different values for some environment variables on different nodes. For example, one could have set NCCL_LL_THRESHOLD=0 on one node while not doing so on other nodes. From the message truncation warning you are getting, that is likely the cause , i.e. 1048576 = 1MB is the message size used for the Simple algorithm (when NCCL_LL_THRESHOLD is forced to 0), whereas 32768 = 32KB could be the message size used for the LL algorithm (when NCCL_LL_THRESHOLD is not set).

372046933 commented 4 years ago

Well, I run the script in docker, both containers have the same image and ENV set by Kubernetes. I have checked that the Docker image checksum is same on all nodes. By the way, TensorFlow 2.0 compiled NCCL 2.4.7 together.

kwen2501 commented 4 years ago

Thanks for the check. The next thing I would check is whether the ranks make the collective call with different sizes.