NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.21k stars 807 forks source link

NCCL WARN NET/Socket : message truncated #268

Open zwqjoy opened 4 years ago

zwqjoy commented 4 years ago

When Use NCCL trainning

NCCL version 2.4.7+cudaCUDA_MAJOR.CUDA_MINOR hvd1:2918:3333 [0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/Socket : message truncated : receiving 696320 bytes instead of 32768

WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.
WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.
Train for 937 steps
Epoch 1/3
2019-11-28 16:59:29.962581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-11-28 16:59:31.771855: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
**NCCL version 2.4.7+cudaCUDA_MAJOR.CUDA_MINOR**
hvd1:2918:3332 [0] NCCL INFO Setting affinity for GPU 0 to ffffff,ffffffff
hvd1:2918:3332 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
hvd1:2918:3332 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:24 -> 2
hvd1:2918:3332 [0] NCCL INFO CUDA Dev 0[3], Socket NIC distance :  SYS
hvd1:2918:3332 [0] NCCL INFO Channel 00 :    0   1
hvd1:2918:3332 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
hvd1:2918:3332 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:24 -> 2
hvd1:2918:3332 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
hvd1:2918:3332 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
hvd1:2918:3332 [0] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
hvd1:2918:3332 [0] NCCL INFO comm 0x7fbf380021e0 rank 0 nranks 2 cudaDev 0 nvmlDev 3 - Init COMPLETE
hvd1:2918:3331 [0] NCCL INFO Launch mode Parallel
 17/937 [..............................] - ETA: 6:57 - loss: 5.9124 - sparse_categorical_accuracy: 0.0358   
**hvd1:2918:3333 [0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/Socket : message truncated : receiving 696320 bytes instead of 32768**
hvd1:2918:3333 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:34 -> 3
hvd1:2918:3333 [0] NCCL INFO external/nccl_archive/src/transport/net.cc:533 -> 3
hvd1:2918:3333 [0] NCCL INFO external/nccl_archive/src/transport.cc:163 -> 3 [Proxy Thread]
sjeaugey commented 4 years ago

Thanks for the report. It seems to be running NCCL 2.4.7 compiled in a weird way that the CUDA_MAJOR/CUDA_MINOR were not replaced.

Can you tell us more about the environment in which you are running ? For example, where this tensorflow version comes from, how NCCL was compiled, and on which platform you are executing ?

That would help us try to reproduce and understand the issue. Also it might help us figure out how to try a newer version like 2.4.8, or 2.5.6.

372046933 commented 4 years ago

I met up with the same problem on TensorFlow 2.0.1 official docker image.

sjeaugey commented 4 years ago

This looks like a setup issue, like different ranks not calling NCCL consistently.

I would suggest submitting the issue on Tensorflow and see if they have advice on what could be wrong.

shanguanma commented 2 years ago

I also meet same issue,My compute environment is as follows: NCCL version 2.10.3+cuda11.1, ubuntu20.04, I am using multiple machines and multiple GPUs of pytorch to train model. The detail is as follows:

transport/net_socket.cc:424 NCCL WARN NET/Socket : peer 192.168.161.30<59210> message truncated :
receiving 1048576 bytes instead of 65536

I don't konw how to slove it.

sjeaugey commented 2 years ago

Can you set NCCL_PROTO=SIMPLE and see if the problem still happens?

shanguanma commented 2 years ago

@sjeaugey, Thanks for your reply. The issue still can't be solved via your suggestion.

By the way. I found that the case is ok, the case is as follows: I only reduce the train data scale (from 150 hours to 2 mins), the distribution training is working.
My NCCL environment is as follows:

export NCCL_SOCKET_IFNAME="eno1np0"
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1  # because I haven't IB 
export NCCL_DEBUG_SUBSYS=ENV

However, when data is big (e.g. 150 hours ) the distribution training is not working

I hope you to help me. Thanks a lot.

sjeaugey commented 2 years ago

Sorry for the delay.

This could be due to mostly two things:

Hope this helps.