Open alexis-allemann opened 5 months ago
Thanks! I get your point but the only risk is to go with many OOM without realizing. What were the circumstances under which you faced all of this?
I encountered this error during a standard experiment when training a translation model. In my experiment, I had set a batch size that almost completely filled the memory of my GPUs. After a few steps (about a thousand), a batch triggered an OOM error on one of my GPUs, causing the training process to abort. It's not very practical to interrupt the training process due to an infrequent OOM memory error.
I've made an update to my pull request to try and recalculate the gradients with a new batch, which seems to be a better approach than just filling the tensors with zeros. Perhaps we could consider adding an option to specify an allowed number of attempts before terminating the learning process? For example, opt.max_oom_batch_retries
. What do you think about this?
you can try this approach but then run it with a knowingly batch size that is too big to trigger OOM, I think it is not bullet proof and will trigger exceptions, just saying but interested to know. NB are you using sentence or token batch sizes?
You're right, I've just tried it and it doesn't seem such a good idea because a timeout now occurs on the torch.distributed.all_gather method because a deadlock occurs between the processes... To answer your other question, I use token batch sizes.
Issue #2549