NCCL WARN NET/Socket : message truncated

zwqjoy commented 4 years ago

When Use NCCL trainning

NCCL version 2.4.7+cudaCUDA_MAJOR.CUDA_MINOR hvd1:2918:3333 [0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/Socket : message truncated : receiving 696320 bytes instead of 32768

WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.
WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.
Train for 937 steps
Epoch 1/3
2019-11-28 16:59:29.962581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-11-28 16:59:31.771855: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
**NCCL version 2.4.7+cudaCUDA_MAJOR.CUDA_MINOR**
hvd1:2918:3332 [0] NCCL INFO Setting affinity for GPU 0 to ffffff,ffffffff
hvd1:2918:3332 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
hvd1:2918:3332 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:24 -> 2
hvd1:2918:3332 [0] NCCL INFO CUDA Dev 0[3], Socket NIC distance :  SYS
hvd1:2918:3332 [0] NCCL INFO Channel 00 :    0   1
hvd1:2918:3332 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
hvd1:2918:3332 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:24 -> 2
hvd1:2918:3332 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
hvd1:2918:3332 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
hvd1:2918:3332 [0] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
hvd1:2918:3332 [0] NCCL INFO comm 0x7fbf380021e0 rank 0 nranks 2 cudaDev 0 nvmlDev 3 - Init COMPLETE
hvd1:2918:3331 [0] NCCL INFO Launch mode Parallel
 17/937 [..............................] - ETA: 6:57 - loss: 5.9124 - sparse_categorical_accuracy: 0.0358   
**hvd1:2918:3333 [0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/Socket : message truncated : receiving 696320 bytes instead of 32768**
hvd1:2918:3333 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:34 -> 3
hvd1:2918:3333 [0] NCCL INFO external/nccl_archive/src/transport/net.cc:533 -> 3
hvd1:2918:3333 [0] NCCL INFO external/nccl_archive/src/transport.cc:163 -> 3 [Proxy Thread]

sjeaugey commented 4 years ago

Thanks for the report. It seems to be running NCCL 2.4.7 compiled in a weird way that the CUDA_MAJOR/CUDA_MINOR were not replaced.

Can you tell us more about the environment in which you are running ? For example, where this tensorflow version comes from, how NCCL was compiled, and on which platform you are executing ?

That would help us try to reproduce and understand the issue. Also it might help us figure out how to try a newer version like 2.4.8, or 2.5.6.

372046933 commented 4 years ago

I met up with the same problem on TensorFlow 2.0.1 official docker image.

sjeaugey commented 4 years ago

This looks like a setup issue, like different ranks not calling NCCL consistently.

I would suggest submitting the issue on Tensorflow and see if they have advice on what could be wrong.

shanguanma commented 2 years ago

I also meet same issue,My compute environment is as follows: NCCL version 2.10.3+cuda11.1, ubuntu20.04, I am using multiple machines and multiple GPUs of pytorch to train model. The detail is as follows:

transport/net_socket.cc:424 NCCL WARN NET/Socket : peer 192.168.161.30<59210> message truncated :
receiving 1048576 bytes instead of 65536

I don't konw how to slove it.

sjeaugey commented 2 years ago

Can you set NCCL_PROTO=SIMPLE and see if the problem still happens?

shanguanma commented 2 years ago

@sjeaugey, Thanks for your reply. The issue still can't be solved via your suggestion.

By the way. I found that the case is ok, the case is as follows: I only reduce the train data scale (from 150 hours to 2 mins), the distribution training is working.
My NCCL environment is as follows:

export NCCL_SOCKET_IFNAME="eno1np0"
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1  # because I haven't IB 
export NCCL_DEBUG_SUBSYS=ENV

However, when data is big (e.g. 150 hours ) the distribution training is not working

I hope you to help me. Thanks a lot.

sjeaugey commented 2 years ago

Sorry for the delay.

This could be due to mostly two things:

A protocol mismatch, which is what I wanted to check setting NCCL_PROTO=SIMPLE. This can happen if different environment variables are set on the different nodes, causing NCCL to choose to use the LL protocol (which has a 64K chunk size) on one node, and the Simple protocol (which has a 1MB chunk size) on another node. Given that was exactly the numbers reported by your log, that was the most probable. Are you sure NCCL_PROTO was set on all ranks? Are you getting the exact same error message, with same sizes (65536 and 1048576)?
- An application bug, where different ranks calls NCCL allreduce with a different size. This is also quite common. You can detect that by setting NCCL_DEBUG_SUBSYS=COLL and checking all ranks always call NCCL with the same sizes, but that generates a lot of output and can be tedious if it fails after a long time. Also, if setting NCCL_PROTO=SIMPLE causes the error to still happen but with different symptoms, it's likely to be such an issue where one rank runs out of data before the others and that generates an error.

Hope this helps.

NVIDIA / nccl

NCCL WARN NET/Socket : message truncated #268