Some NCCL operations have failed or timed out.

dbcSep03 commented 6 months ago

rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=_ALLGATHER_BASE, NumelIn=7168, NumelOut=14336, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e2ae5781d87 in /home/dongbingcheng/anaconda3/envs/llmfinetuning/lib/python3.9/site-packages/torch/lib/libc10.so)

我是双卡训练，感觉是训练完第一个epoch就出现这个错误我使用的是实现的train.py文件感觉是不是评估的时候，前面进程没结束添加个accelerator.wait_for_everyone() 谢谢解答！

dbcSep03 commented 6 months ago

这是wandb log的，可以在最后一个step 也相差了不少，怎么解决呢