FlagAI-Open / FlagAI

FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model.
Apache License 2.0
3.83k stars 416 forks source link

aquila-7b 案例中预训脚本练报错 #395

Closed weicheng59 closed 1 year ago

weicheng59 commented 1 year ago

System Info

跑预训练,bmtrain 有这个报错 bash dist_triggerdocker.sh hostfile Aquila-pretrain.yaml aquila-7b test0 ![9201686728893 pic](https://github.com/FlagAI-Open/FlagAI/assets/8345745/68d38f2e-ff4f-46ef-a3f6-115e4848ca5a) 但是在本地尝试这个方法,是可以正常运行 9221686729012_ pic 本地环境, cuda 11.7,torch 1.13.1,FlagAI 1.7.1,bmtrain 0.2.2

Information

Tasks

Reproduction

1, cd examples/Aquila bash dist_trigger_docker.sh hostfile Aquila-pretrain.yaml aquila-7b test0

2, check log file and found errors in screenshot above

Expected behavior

start pre-training

ftgreat commented 1 year ago

可以贴下 hostfile 吗

weicheng59 commented 1 year ago

image 单机,这个 ip 是根据下面的命令看到的 export NODE_ADDR=$(ifconfig -a|grep inet|grep -v 127.0.0.1|grep -v inet6|awk '{print $2;}'|sed -n '1P')

ftgreat commented 1 year ago

image 单机,这个 ip 是根据下面的命令看到的 export NODE_ADDR=$(ifconfig -a|grep inet|grep -v 127.0.0.1|grep -v inet6|awk '{print $2;}'|sed -n '1P')

好像群里反馈过,hostfile有空行?

ftgreat commented 1 year ago

先关闭issue,如有问题请再打开。谢谢