OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.67k stars 2.24k forks source link

List index out of range in onmt.utils.distributed.all_reduce_and_rescale_tensors:51 #2549

Open alexis-allemann opened 5 months ago

alexis-allemann commented 5 months ago

When training a model across multiple GPUs using parallel_mode="data_parallel", if a "CUDA out of memory" error occurs, an exception is triggered in the onmt.utils.distributed.all_reduce_and_rescale_tensors function:

buffer_t = (
        tensors[0].new(math.ceil(buffer_size / tensors[0].element_size())).zero_()
)

The issue arises because, in the event of an "OOM" error, gradients are not computed, leading to an empty list for the "tensors" parameter.

The solution is to provide a list with tensors filled with zeros. This ensures that the torch.distributed.all_reduce function can continue functioning without getting stuck indefinitely. However, the drawback is that accumulated gradients are still divided by the total number of GPUs.