When training a model across multiple GPUs using parallel_mode="data_parallel", if a "CUDA out of memory" error occurs, an exception is triggered in the onmt.utils.distributed.all_reduce_and_rescale_tensors function:
The issue arises because, in the event of an "OOM" error, gradients are not computed, leading to an empty list for the "tensors" parameter.
The solution is to provide a list with tensors filled with zeros. This ensures that the torch.distributed.all_reduce function can continue functioning without getting stuck indefinitely. However, the drawback is that accumulated gradients are still divided by the total number of GPUs.
When training a model across multiple GPUs using
parallel_mode="data_parallel"
, if a "CUDA out of memory" error occurs, an exception is triggered in theonmt.utils.distributed.all_reduce_and_rescale_tensors
function:The issue arises because, in the event of an "OOM" error, gradients are not computed, leading to an empty list for the "tensors" parameter.
The solution is to provide a list with tensors filled with zeros. This ensures that the
torch.distributed.all_reduce
function can continue functioning without getting stuck indefinitely. However, the drawback is that accumulated gradients are still divided by the total number of GPUs.