Closed oho-work closed 8 months ago
发生在微调模型的时候,进度特别慢
看着像是通信卡住了,建议排查NCCL的环境变量设置和相关硬件配置,请咨询系统管理员或运维。自行操作的话,可以走下NCCL的troubleshooting: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
十分感谢您的回复,确实是NCCL的环境变量设置有问题
有遇到类似问题的兄弟,可以看一下这篇知乎 多卡运行分布式训练卡死
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
No response
备注 | Anything else?
No response