Got empty batch when using multiple gpus resuming from a checkpoint

18445864529 commented 2 years ago

When I resume training from a saved checkpoint with 4 GPUs. In the main loop of training iterations (i.e., for i, samples in enumerate(progress):, I got an empty batch samples=[{}] at the beginning of the fetching for 1 out of 4 GPUs. But if I use 1 GPU with the same code, there will be no empty batch.

An even more weird phenomenon is, I tested with: (also resuming from the checkpoint)

for samples in progress: 
    print(samples)

When using 1 GPU, the behavior was normal, keep enumerating the dataloader and printing. But when using 4 GPUs, there were only 4 pieces of outputs, with 1 being the empty batch [{}], as if there were only 3 batches in the dataset (which is not true as one epoch actually contains hundreds of batches).

Any clue about this issue? Thank you in advance.

dearchill commented 2 years ago

I may have a similar empty batch phenomenon, but I don't see clear relation with the gpu nums, instead with the linux kernel version. Have you solved this then?

18445864529 commented 2 years ago

Not really, this problem is only reproducible by loading that specific checkpoint but not others. I re-trained the model and the new checkpoint seems ok to load. Didn't encounter this problem recently so I didn't take much care of it.

And just a guess, I forced the model to save after 20k iters, whereas ckpts are normally saved after one entire epoch, so maybe after 20k iters there happened to be 3 batches left in the dataloader of that epoch, thus resulting in one empty batch when using 4 GPUs. But this does not explain why when using 1 GPU the dataloader just restored full instead of having only 3 batches.

longkhanh-fam commented 1 month ago

I encounter this problem also, how to solve when the batch is empty

facebookresearch / fairseq

Got empty batch when using multiple gpus resuming from a checkpoint #3999