Closed Cristy94 closed 3 years ago
Quick note: The graphs can be somehwat fixed if I change the Horizontal Axis from "Step" to "Relative" or "Wall" (as that would correctly order the points, instead of overlapping them).
@Cristy94 Yes you are right. This is because we didn't load global_step (maybe) so that if you re-train, the global_step starts again from 0. Your approach can work (by changing to "Wall").
After you train a model, let's say it reaches epoch 10, at 4000iter/epoch, so 40000iter and you stop training.
When you resume training, it loads the model but starts again from epoch 0, 0 iterations. This makes is so that new checkpoints are (wrongly) saved with
snap-4000
andsnap-8000
instead ofsnap-44000
andsnap-48000
(which is the total number of iterations that model was trained for). Another problem is that the events emited would have the wrong number of iterations, so the graphs in tensorboard would be messed up.