Open wd255 opened 3 months ago
Most of the messages you included come from PyTorch; as far as I can see, none of them are generated by NCCL. PyTorch simply reports that a NCCL collective operation has timed out. That could be for any number of reasons, from the timeout being too short, through an application error, a bug in NCCL, system misconfiguration, to hardware issues.
You may want to try running your app with the NCCL_DEBUG=WARN
environment variable set -- that way if NCCL encounters an issue, it will print an error message. But, given that we don't see PyTorch reporting any errors from NCCL, I wouldn't count on getting anything that way.
Does NCCL run on that system at all? E.g., have you been able to run some of the NCCL tests like all_reduce_perf
to a successful completion?
Thanks! I increased the timeout and will give it another try with NCCL_DEBUG=WARN. I'll also run NCCL test. It's a fresh machine where NCCL is newly installed, so maybe there's something wrong with the NCCL setup?
i meet the same problem
My suggestion is to try to set the environment variables of NCCL. For example, export NCCL_SOCKET_IFNAME=eth0 (eth0 is the network card number corresponding to the IP to be used, as found through ifconfig). And export NCCL_P2P_DISABLE=1 (I solved the problem after setting this). If it still doesn't work, you can try the environment variable settings mentioned in other answers.
Thanks! I increased the timeout and will give it another try with NCCL_DEBUG=WARN. I'll also run NCCL test. It's a fresh machine where NCCL is newly installed, so maybe there's something wrong with the NCCL setup?
Hi, how do you increase the timeout?
Hi, I'm having this issue: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078...) ran for 600026 milliseconds before timing out
The code I'm running is a VQGAN training script. Parallelism is done with accelerate. This issue happens at the end of each epoch. We believe it's the problem of the environment setup, as the code could run on another machine without any problem
environment: Ubuntu 20 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Fri_Jun_14_16:34:21_PDT_2024 Cuda compilation tools, release 12.6, V12.6.20 Build cuda_12.6.r12.6/compiler.34431801_0
full log
Can anyone help me with it?