List index out of range in onmt.utils.distributed.all_reduce_and_rescale_tensors:51

When training a model across multiple GPUs using parallel_mode="data_parallel", if a "CUDA out of memory" error occurs, an exception is triggered in the onmt.utils.distributed.all_reduce_and_rescale_tensors function:

buffer_t = (
        tensors[0].new(math.ceil(buffer_size / tensors[0].element_size())).zero_()
)

The issue arises because, in the event of an "OOM" error, gradients are not computed, leading to an empty list for the "tensors" parameter.

The solution is to provide a list with tensors filled with zeros. This ensures that the torch.distributed.all_reduce function can continue functioning without getting stuck indefinitely. However, the drawback is that accumulated gradients are still divided by the total number of GPUs.

OpenNMT / OpenNMT-py

List index out of range in onmt.utils.distributed.all_reduce_and_rescale_tensors:51 #2549