Open Itok2000u opened 2 years ago
It's not a bug in the code, but an artifact of the run that is happening.
In your log, after step 204000, no actual training step (weight update) has happened and skipped_steps
increases from 1 and beyond. (up to 3 as far as I can see).
The very next weight update the model makes (without skipping step), it would save the checkpoint.
It could be that your run has diverged and model will only skip steps further until loss scaler falls down to 0.0 and eventually resulting in a failure.
In fact, the training has not been terminated. The weights in the model continue to get updated. The following screenshot is part of the log, which shows that the training does not end.
As you can see, the loss decrease from 1.7 to 1.5 and there also does not exist any falls down on loss scaler.
Whenever count of skipped_steps increases, that means the loss scaler gets divided by 2. The training is progressing as long as skipped_steps doesn't increase, but Training_Iteration step count increases, as you have observed. Checkpoints will be saved as Training_Iteration count increases..
Related to Bert/Pytorch
Describe the bug After running a long period, for example, after 200,000 iterations, there will be some skipped steps. Such skipped steps are counted into the total steps. Thus, the number of the steps will never become a multiple of save_checkpoint_steps.
For example, I set save_checkpoint_steps = 1000, which means save a checkpoint after 1000 iters. However, after I ran 200,000 iters, there is a skip step, which causes the total number of the steps become 200,999(some number like that). This will cause the program to stop save the checkpoints.
To Reproduce Steps to reproduce the behavior:
Some screenshot For the above screenshot, please notice that, after iter 204000, no checkpoints has been saved. For the above codes, please notice the behavior of the variable called dynamic_optimizer_step.
Environment Please provide at least: