Open wd255 opened 1 month ago
Most of the messages you included come from PyTorch; as far as I can see, none of them are generated by NCCL. PyTorch simply reports that a NCCL collective operation has timed out. That could be for any number of reasons, from the timeout being too short, through an application error, a bug in NCCL, system misconfiguration, to hardware issues.
You may want to try running your app with the NCCL_DEBUG=WARN
environment variable set -- that way if NCCL encounters an issue, it will print an error message. But, given that we don't see PyTorch reporting any errors from NCCL, I wouldn't count on getting anything that way.
Does NCCL run on that system at all? E.g., have you been able to run some of the NCCL tests like all_reduce_perf
to a successful completion?
Thanks! I increased the timeout and will give it another try with NCCL_DEBUG=WARN. I'll also run NCCL test. It's a fresh machine where NCCL is newly installed, so maybe there's something wrong with the NCCL setup?
Hi, I'm having this issue: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078...) ran for 600026 milliseconds before timing out
The code I'm running is a VQGAN training script. Parallelism is done with accelerate. This issue happens at the end of each epoch. We believe it's the problem of the environment setup, as the code could run on another machine without any problem
environment: Ubuntu 20 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Fri_Jun_14_16:34:21_PDT_2024 Cuda compilation tools, release 12.6, V12.6.20 Build cuda_12.6.r12.6/compiler.34431801_0
full log
Can anyone help me with it?