Open yanminjia opened 6 months ago
Hello, may I ask if the problem has been resolved? I have also encountered a similar problem. My program is blocking here: WorkNCCL (OpType=ALLGATHER-BASE, Timeout (ms)=1800000)
I have also encountered a similar problem: RuntimeError: NCCL communicator was aborted on rank 5. Original reason for failure was: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7492681, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802521 milliseconds before timing out.
If the problem has been resolved? I have also encountered a similar problem: NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
These are extremely generic reports that don't contain enough information for us to even begin trying to diagnose what the issue(s) might be.
NCCL_DEBUG=WARN
. If possible, NCCL_DEBUG=INFO
would be better (but the output will be much larger).
magatron AL/ML training hangs up with error messages as following. ReduceScatter failed to be finished within the timeout (30mins). It is tricky that no error log reported from NCCL. I have no idea how to deug this issue. Would be highly appreciated if any clue. Many thanks for your time. @sjeaugey