HuangLK / transpeeder

train llama on a single A100 80G node using 🤗 transformers and 🚀 Deepspeed Pipeline Parallelism
Apache License 2.0
208 stars 18 forks source link

使用4块3090全量微调7B-llama时发现启动训练阶段非常缓慢 #19

Closed Ulov888 closed 1 year ago

Ulov888 commented 1 year ago

image 特别是初始化这个wandb的阶段,卡了10几分钟,请问你有遇到多卡训练启动慢的问题吗,有什么可能的改善方案吗@Huanglk

Ulov888 commented 1 year ago

已解决,wandb要求输入choice