QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

各位大佬 GPU功耗很低 但是GPU利用率满载 是什么情况[BUG] <title> #970

Closed oho-work closed 8 months ago

oho-work commented 8 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

gpu

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

No response

备注 | Anything else?

No response

oho-work commented 8 months ago

发生在微调模型的时候,进度特别慢

jklj077 commented 8 months ago

看着像是通信卡住了,建议排查NCCL的环境变量设置和相关硬件配置,请咨询系统管理员或运维。自行操作的话,可以走下NCCL的troubleshooting: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

oho-work commented 8 months ago

十分感谢您的回复,确实是NCCL的环境变量设置有问题

oho-work commented 8 months ago

有遇到类似问题的兄弟,可以看一下这篇知乎 多卡运行分布式训练卡死