Closed xqmmy closed 1 year ago
试了很多方法都不太行
你可以参考指引来问问题。像上面,你应该把你的pytorch版本和CUDA版本告诉我,就一个报错信息我很难去猜测你的具体情况。
好的,抱歉我没描述清楚 torch1.13.1 cuda11.7 使用的是完整的merge数据和原始的多卡finetune脚本 使用的A4000显卡、7张
你配置对应的pytorch安装脚本应该是“pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117” 看看有没有误下成cpu版本。
你使用单卡的时候能成功吗?多卡的时候使用双卡、四卡等配置看看是否会有问题,这里有个类似的issue
可以在跑的时候加上export NCCL_DEBUG=INFO,看看有没有更详细的报错输出。 或者你可以看看这个有没有帮助
我跟你问题一样,超时,单机多卡的 改了bios,关闭ACS 后解决问题,你参考下
具体可以参考下这个:https://www.modb.pro/db/617940
换了3090就没问题,不知道啥原因
我跟你问题一样,超时,单机多卡的 改了bios,关闭ACS 后解决问题,你参考下
谢谢解答,我再试试
solution in https://github.com/NVIDIA/nccl/issues/426 works.
export NCCL_IB_GID_INDEX=3 solved my problem.
[E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out. localhost:10108:60893 [4] NCCL INFO [Service thread] Connection closed by localRank 4 Traceback (most recent call last): localhost:10108:60278 [0] NCCL INFO comm 0x9ea4350 rank 4 nranks 7 cudaDev 4 busId a1000 - Abort COMPLETE File "finetune.py", line 271, in
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
File "/root/miniconda3/envs/vicuna/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801815 milliseconds before timing out.
localhost:10101:60894 [3] NCCL INFO [Service thread] Connection closed by localRank 3
localhost:10101:59694 [0] NCCL INFO comm 0xa8d5900 rank 3 nranks 7 cudaDev 3 busId 81000 - Abort COMPLETE
Traceback (most recent call last):
File "finetune.py", line 271, in
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)