NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
2.97k stars 757 forks source link

AI/ML training hangs up with no error report from NCCL #1111

Open yanminjia opened 6 months ago

yanminjia commented 6 months ago

magatron AL/ML training hangs up with error messages as following. ReduceScatter failed to be finished within the timeout (30mins). It is tricky that no error log reported from NCCL. I have no idea how to deug this issue. Would be highly appreciated if any clue. Many thanks for your time. @sjeaugey

[E ProcessGroupNCCL.cpp:481] [Rank 65] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1559, OpType=_REDUCE_SCATTER_BASE, NumelIn=3121908768, NumelOut=32519883, Timeout(ms)=1800000) ran for 1800441 milliseconds before timing out. [E ProcessGroupNCCL.cpp:481] [Rank 66] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1559, OpType=_REDUCE_SCATTER_BASE, NumelIn=3121908768, NumelOut=32519883, Timeout(ms)=1800000) ran for 1800779 milliseconds before timing out. [E ProcessGroupNCCL.cpp:481] [Rank 64] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1559, OpType=_REDUCE_SCATTER_BASE, NumelIn=3121908768, NumelOut=32519883, Timeout(ms)=1800000) ran for 1800780 milliseconds before timing out. [E ProcessGroupNCCL.cpp:481] [Rank 67] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1559, OpType=_REDUCE_SCATTER_BASE, NumelIn=3121908768, NumelOut=32519883, Timeout(ms)=1800000) ran for 1800792 milliseconds before timing out. [E ProcessGroupNCCL.cpp:495] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:501] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:935] [Rank 64] NCCL watchdog thread terminated with exception: [Rank 64] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1559, OpType=_REDUCE_SCATTER_BASE, NumelIn=3121908768, NumelOut=32519883, Timeout(ms)=1800000) ran for 1800780 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 64] NCCL watchdog thread terminated with exception: [Rank 64] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1559, OpType=_REDUCE_SCATTER_BASE, NumelIn=3121908768, NumelOut=32519883, Timeout(ms)=1800000) ran for 1800780 milliseconds before timing out. [E ProcessGroupNCCL.cpp:495] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:501] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:935] [Rank 67] NCCL watchdog thread terminated with exception: [Rank 67] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1559, OpType=_REDUCE_SCATTER_BASE, NumelIn=3121908768, NumelOut=32519883, Timeout(ms)=1800000) ran for 1800792 milliseconds before timing out

jeason-zhou1 commented 5 months ago

Hello, may I ask if the problem has been resolved? I have also encountered a similar problem. My program is blocking here: WorkNCCL (OpType=ALLGATHER-BASE, Timeout (ms)=1800000)

wwj-2017-1117 commented 5 months ago

I have also encountered a similar problem: RuntimeError: NCCL communicator was aborted on rank 5. Original reason for failure was: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7492681, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802521 milliseconds before timing out.

bingnandu commented 1 month ago

If the problem has been resolved? I have also encountered a similar problem: NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.

kiskra-nvidia commented 1 month ago

These are extremely generic reports that don't contain enough information for us to even begin trying to diagnose what the issue(s) might be.