Open salvacarrion opened 3 years ago
I solved it by using a different optimizer (nag
instead of adam
). At this point, I don't know if this is a bug or just a weird way of saying that the hyperparameters I chose for my model are not correct
I have the same problem, and the grad_norm on one of workers is 0. I don't why until now. Thanks for your solution.
Same problem here. But my grad_norm
values are moderate (not nan
, inf
nor 0
) and quite close.
grad_norm across the workers:
rank 0 = 11.16489500
rank 1 = 10.57914402
I have confirmed that my 2 GPUs are the same.
I can solve it by changing it to no_c10d
but I still would like to figure out why I cannot use c10d
for acceleration.
Other solutions (not work for me)
Could you take a look at this? Any suggestions would be highly appreciated. Thanks for taking your time, @dianaml0!
🐛 Bug
I can train Transformers but not Fully convolutional or LSTMs models (e.g.:
fconv,fconv_iwslt_de_en, fconv_wmt_en_de, lstm, lstm_luong_wmt_en_de,
...) because gradients are inconsistent between workers.Following this thread, I have tried a range of different args such as
--ddp-backend=no_c10d, --ddp-backend=legacy_ddp, --use-bmuf,...
. Furthermore, I've also limited the max-tokens (256, 512, 1024,...) and the batch size (16, 32, 64) to account for memory problems.To Reproduce
I'm using the europarl v7 es-en dataset tokenized with Moses and fastBPE, but this error appears regardless of the dataset.
Error:
Environment
pip
, source):pip install --editable ./