Open marctessier opened 4 days ago
I was thinking about this issue. In order to get some training loss value, we need to run at least on batch but then if we use up one batch, we are no longer at step 0 for that resumed run. If we want to remove the gap, should we save the last losses then when we reload the model, we could send those saved values to tensorboard
to bridge the gap.
Note that, when we resume, pytorch lightning actually performs one epoch of evaluation then records it to tensorboard
then actually proceed to resume training. If we were to use the losses calculated during that first evaluation phase, we could get losses' value at step 0 but they would most likely not align with the training losses' value calculated at the end of the final run aka the run that is prior to resuming aka the values of the last checkpoint we are currently resuming from.
Bug description
When doing a FP fine-tune , in Tensorboard it looks like the next round start 50 steps ahead.
See image
How to reproduce the bug
Error messages and logs
No error message.
Environment
Standard ENV , nothing special. This will be for after PR 547 is merged. ( Resume at the end of the last trained epoch #547 , Issue #534 )
More info
none