Open 18445864529 opened 2 years ago
I may have a similar empty batch phenomenon, but I don't see clear relation with the gpu nums, instead with the linux kernel version. Have you solved this then?
Not really, this problem is only reproducible by loading that specific checkpoint but not others. I re-trained the model and the new checkpoint seems ok to load. Didn't encounter this problem recently so I didn't take much care of it.
And just a guess, I forced the model to save after 20k iters, whereas ckpts are normally saved after one entire epoch, so maybe after 20k iters there happened to be 3 batches left in the dataloader of that epoch, thus resulting in one empty batch when using 4 GPUs. But this does not explain why when using 1 GPU the dataloader just restored full instead of having only 3 batches.
I encounter this problem also, how to solve when the batch is empty
When I resume training from a saved checkpoint with 4 GPUs. In the main loop of training iterations (i.e.,
for i, samples in enumerate(progress):
, I got an empty batchsamples=[{}]
at the beginning of the fetching for 1 out of 4 GPUs. But if I use 1 GPU with the same code, there will be no empty batch.An even more weird phenomenon is, I tested with: (also resuming from the checkpoint)
When using 1 GPU, the behavior was normal, keep enumerating the dataloader and printing. But when using 4 GPUs, there were only 4 pieces of outputs, with 1 being the empty batch [{}], as if there were only 3 batches in the dataset (which is not true as one epoch actually contains hundreds of batches).
Any clue about this issue? Thank you in advance.