Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案,结构参考alpaca
https://github.com/Facico/Chinese-Vicuna
Apache License 2.0
4.14k stars 421 forks source link

单机多卡一直报超时错误,请教下大佬有没有啥解决的办法啊 #53

Closed xqmmy closed 1 year ago

xqmmy commented 1 year ago

[E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out. localhost:10108:60893 [4] NCCL INFO [Service thread] Connection closed by localRank 4 Traceback (most recent call last): localhost:10108:60278 [0] NCCL INFO comm 0x9ea4350 rank 4 nranks 7 cudaDev 4 busId a1000 - Abort COMPLETE File "finetune.py", line 271, in [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. trainer.train(resume_from_checkpoint=args.resume_from_checkpoint) [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. File "/root/miniconda3/envs/vicuna/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801815 milliseconds before timing out. localhost:10101:60894 [3] NCCL INFO [Service thread] Connection closed by localRank 3 localhost:10101:59694 [0] NCCL INFO comm 0xa8d5900 rank 3 nranks 7 cudaDev 3 busId 81000 - Abort COMPLETE Traceback (most recent call last): File "finetune.py", line 271, in trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)

xqmmy commented 1 year ago

试了很多方法都不太行

Facico commented 1 year ago

你的cuda和pytorch版本是能对上的吗,你可以到pytorch官网看看

xqmmy commented 1 year ago

你的cuda和pytorch版本是能对上的吗,你可以到pytorch官网看看

是对应的

Facico commented 1 year ago

你可以参考指引来问问题。像上面,你应该把你的pytorch版本和CUDA版本告诉我,就一个报错信息我很难去猜测你的具体情况。

xqmmy commented 1 year ago

你可以参考指引来问问题。像上面,你应该把你的pytorch版本和CUDA版本告诉我,就一个报错信息我很难去猜测你的具体情况。

好的,抱歉我没描述清楚 torch1.13.1 cuda11.7 使用的是完整的merge数据和原始的多卡finetune脚本 使用的A4000显卡、7张

Facico commented 1 year ago

你配置对应的pytorch安装脚本应该是“pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117” 看看有没有误下成cpu版本。

你使用单卡的时候能成功吗?多卡的时候使用双卡、四卡等配置看看是否会有问题,这里有个类似的issue

可以在跑的时候加上export NCCL_DEBUG=INFO,看看有没有更详细的报错输出。 或者你可以看看这个有没有帮助

zhoujx4 commented 1 year ago

我跟你问题一样,超时,单机多卡的 改了bios,关闭ACS 后解决问题,你参考下

zhoujx4 commented 1 year ago

具体可以参考下这个:https://www.modb.pro/db/617940

xqmmy commented 1 year ago

换了3090就没问题,不知道啥原因

xqmmy commented 1 year ago

我跟你问题一样,超时,单机多卡的 改了bios,关闭ACS 后解决问题,你参考下

谢谢解答,我再试试

thelongestusernameofall commented 11 months ago

solution in https://github.com/NVIDIA/nccl/issues/426 works.

export NCCL_IB_GID_INDEX=3 solved my problem.