clue-ai / ChatYuan

ChatYuan: Large Language Model for Dialogue in Chinese and English
https://www.clueai.cn
Other
1.9k stars 183 forks source link

关于distributed训练的问题 #47

Closed Tian14267 closed 1 year ago

Tian14267 commented 1 year ago

大神你好。我在实验distributed-training的代码进行分布式多卡训练的时候,提示这个问题:

Traceback (most recent call last):
  File "train.py", line 20, in <module>
    hvd.init()
AttributeError: module 'horovod.torch' has no attribute 'init'

请问这个是啥情况啊

我的环境: horovod == 0.23.0 torch == 2.0.0

Tian14267 commented 1 year ago

init 问题解决。但是遇到新问题。 训练过程中,设置了双卡训练,但是只有一张卡的训练。这个是啥情况啊

joytianya commented 1 year ago

@vaas1993 帮忙看看

Tian14267 commented 1 year ago

@vaas1993 帮忙看看

咦,好像可以了。已经解决。谢谢大佬