[Logging] Epoch count/train iterations count are reset when resuming training

Cristy94 commented 4 years ago

After you train a model, let's say it reaches epoch 10, at 4000iter/epoch, so 40000iter and you stop training.

When you resume training, it loads the model but starts again from epoch 0, 0 iterations. This makes is so that new checkpoints are (wrongly) saved with snap-4000 and snap-8000 instead of snap-44000 and snap-48000 (which is the total number of iterations that model was trained for). Another problem is that the events emited would have the wrong number of iterations, so the graphs in tensorboard would be messed up.

Cristy94 commented 4 years ago

Quick note: The graphs can be somehwat fixed if I change the Horizontal Axis from "Step" to "Relative" or "Wall" (as that would correctly order the points, instead of overlapping them).

JiahuiYu commented 3 years ago

@Cristy94 Yes you are right. This is because we didn't load global_step (maybe) so that if you re-train, the global_step starts again from 0. Your approach can work (by changing to "Wall").

JiahuiYu / generative_inpainting

[Logging] Epoch count/train iterations count are reset when resuming training #466