Open zwqjoy opened 4 years ago
Thanks for the report. It seems to be running NCCL 2.4.7 compiled in a weird way that the CUDA_MAJOR/CUDA_MINOR were not replaced.
Can you tell us more about the environment in which you are running ? For example, where this tensorflow version comes from, how NCCL was compiled, and on which platform you are executing ?
That would help us try to reproduce and understand the issue. Also it might help us figure out how to try a newer version like 2.4.8, or 2.5.6.
I met up with the same problem on TensorFlow 2.0.1 official docker image.
This looks like a setup issue, like different ranks not calling NCCL consistently.
I would suggest submitting the issue on Tensorflow and see if they have advice on what could be wrong.
I also meet same issue,My compute environment is as follows: NCCL version 2.10.3+cuda11.1, ubuntu20.04, I am using multiple machines and multiple GPUs of pytorch to train model. The detail is as follows:
transport/net_socket.cc:424 NCCL WARN NET/Socket : peer 192.168.161.30<59210> message truncated :
receiving 1048576 bytes instead of 65536
I don't konw how to slove it.
Can you set NCCL_PROTO=SIMPLE
and see if the problem still happens?
@sjeaugey, Thanks for your reply. The issue still can't be solved via your suggestion.
By the way. I found that the case is ok, the case is as follows:
I only reduce the train data scale (from 150 hours to 2 mins), the distribution training is working.
My NCCL environment is as follows:
export NCCL_SOCKET_IFNAME="eno1np0"
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1 # because I haven't IB
export NCCL_DEBUG_SUBSYS=ENV
However, when data is big (e.g. 150 hours ) the distribution training is not working
I hope you to help me. Thanks a lot.
Sorry for the delay.
This could be due to mostly two things:
NCCL_PROTO=SIMPLE
. This can happen if different environment variables are set on the different nodes, causing NCCL to choose to use the LL protocol (which has a 64K chunk size) on one node, and the Simple protocol (which has a 1MB chunk size) on another node. Given that was exactly the numbers reported by your log, that was the most probable. Are you sure NCCL_PROTO was set on all ranks? Are you getting the exact same error message, with same sizes (65536 and 1048576)?
Hope this helps.
When Use NCCL trainning
NCCL version 2.4.7+cudaCUDA_MAJOR.CUDA_MINOR hvd1:2918:3333 [0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/Socket : message truncated : receiving 696320 bytes instead of 32768