Open sailor-z opened 2 months ago
Hi! Please try to use bfloat16
or small learning rate.
I also found the same problem, when I set up multi-card running, it is always inefficient.
Adjusting the learning rate or using bfloat16 doesn't work for me.
Hi,
Thanks for releasing the code! I am working on retraining the model using pytorch lightning. It works perfectly when I use a single A100 GPU, but I always get NaN loss when using multiple GPUs. The strategy I am using is DDP and it works well for other methods. What could be the reason?