[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out.
Due to the asynchronous nature of CUDA kernels, subsequent GPU operations
might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the
entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with
exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(
SeqNum=778778, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000
) ran for 1800413 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1
] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=778778, OpT
ype=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800413
milliseconds before timing out.
我在使用您提供的轻量化模型第一阶段训练的时候,在第一个epoch训练结束后,发生NCCL超时的错误,想请问一下,有什么办法解决。我是在八张4090下进行训练的,训练和测试的batchsize_per_device均为2.