单机多卡一直报超时错误，请教下大佬有没有啥解决的办法啊

xqmmy commented 1 year ago

[E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out. localhost:10108:60893 [4] NCCL INFO [Service thread] Connection closed by localRank 4 Traceback (most recent call last): localhost:10108:60278 [0] NCCL INFO comm 0x9ea4350 rank 4 nranks 7 cudaDev 4 busId a1000 - Abort COMPLETE File "finetune.py", line 271, in [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. trainer.train(resume_from_checkpoint=args.resume_from_checkpoint) [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. File "/root/miniconda3/envs/vicuna/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801815 milliseconds before timing out. localhost:10101:60894 [3] NCCL INFO [Service thread] Connection closed by localRank 3 localhost:10101:59694 [0] NCCL INFO comm 0xa8d5900 rank 3 nranks 7 cudaDev 3 busId 81000 - Abort COMPLETE Traceback (most recent call last): File "finetune.py", line 271, in trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)

xqmmy commented 1 year ago

试了很多方法都不太行

Facico commented 1 year ago

你的cuda和pytorch版本是能对上的吗，你可以到pytorch官网看看

xqmmy commented 1 year ago

你的cuda和pytorch版本是能对上的吗，你可以到pytorch官网看看

是对应的

Facico commented 1 year ago

你可以参考指引来问问题。像上面，你应该把你的pytorch版本和CUDA版本告诉我，就一个报错信息我很难去猜测你的具体情况。

xqmmy commented 1 year ago

你可以参考指引来问问题。像上面，你应该把你的pytorch版本和CUDA版本告诉我，就一个报错信息我很难去猜测你的具体情况。

好的，抱歉我没描述清楚 torch1.13.1 cuda11.7 使用的是完整的merge数据和原始的多卡finetune脚本使用的A4000显卡、7张

Facico commented 1 year ago

你配置对应的pytorch安装脚本应该是“pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117” 看看有没有误下成cpu版本。

你使用单卡的时候能成功吗？多卡的时候使用双卡、四卡等配置看看是否会有问题，这里有个类似的issue

可以在跑的时候加上export NCCL_DEBUG=INFO，看看有没有更详细的报错输出。或者你可以看看这个有没有帮助