JiahuiYu / generative_inpainting

DeepFill v1/v2 with Contextual Attention and Gated Convolution, CVPR 2018, and ICCV 2019 Oral
http://jiahuiyu.com/deepfill/
Other
3.27k stars 787 forks source link

[Logging] Epoch count/train iterations count are reset when resuming training #466

Closed Cristy94 closed 3 years ago

Cristy94 commented 4 years ago

After you train a model, let's say it reaches epoch 10, at 4000iter/epoch, so 40000iter and you stop training.

When you resume training, it loads the model but starts again from epoch 0, 0 iterations. This makes is so that new checkpoints are (wrongly) saved with snap-4000 and snap-8000 instead of snap-44000 and snap-48000 (which is the total number of iterations that model was trained for). Another problem is that the events emited would have the wrong number of iterations, so the graphs in tensorboard would be messed up.

image

Cristy94 commented 4 years ago

Quick note: The graphs can be somehwat fixed if I change the Horizontal Axis from "Step" to "Relative" or "Wall" (as that would correctly order the points, instead of overlapping them).

JiahuiYu commented 3 years ago

@Cristy94 Yes you are right. This is because we didn't load global_step (maybe) so that if you re-train, the global_step starts again from 0. Your approach can work (by changing to "Wall").