THUDM / SwissArmyTransformer

SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.
https://THUDM.github.io/SwissArmyTransformer
Apache License 2.0
978 stars 92 forks source link

deepspeed 分布式训练 loss nan or inf #161

Open JohnTang93 opened 9 months ago

JohnTang93 commented 9 months ago

单机多卡训练正常,多机多卡报错

Skipping backward and optimizer step for nan or inf in forwarding metrics/loss!

1049451037 commented 9 months ago

尝试把--fp16换成--bf16