Closed JulioZhao97 closed 1 year ago
Hi! I met the same problem, did you just add –ddp-backend=no_c10d
?
Same issue with me as well, I met the problem with --ddp-backend=no_c10d and even changing adam to nag also giving same error
I am facing the same issue except I got 0 for grad_norm:
grad_norm across the workers: rank 0 = 18.27746773 rank 1 = 0.00000000 rank 2 = 18.27746773 rank 3 = 18.27746773
when i try to pretrain base model, I enable multi-gpus training in pretrain_base.sh by this:
but error occurs as follows:
Could someone please kindly tell me how to fix this or how to enable multi-gpus training? thanks!