NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.33k stars 3.19k forks source link

[Bert/Pytorch] During pretraining, checkpoints won't be saved automatically. #1159

Open Itok2000u opened 2 years ago

Itok2000u commented 2 years ago

Related to Bert/Pytorch

Describe the bug After running a long period, for example, after 200,000 iterations, there will be some skipped steps. Such skipped steps are counted into the total steps. Thus, the number of the steps will never become a multiple of save_checkpoint_steps.

For example, I set save_checkpoint_steps = 1000, which means save a checkpoint after 1000 iters. However, after I ran 200,000 iters, there is a skip step, which causes the total number of the steps become 200,999(some number like that). This will cause the program to stop save the checkpoints.

To Reproduce Steps to reproduce the behavior:

  1. set something to make a skip step.

Some screenshot 9 For the above screenshot, please notice that, after iter 204000, no checkpoints has been saved. 10 For the above codes, please notice the behavior of the variable called dynamic_optimizer_step.

Environment Please provide at least:

sharathts commented 2 years ago

It's not a bug in the code, but an artifact of the run that is happening.

In your log, after step 204000, no actual training step (weight update) has happened and skipped_steps increases from 1 and beyond. (up to 3 as far as I can see).

The very next weight update the model makes (without skipping step), it would save the checkpoint.

It could be that your run has diverged and model will only skip steps further until loss scaler falls down to 0.0 and eventually resulting in a failure.

Itok2000u commented 2 years ago

In fact, the training has not been terminated. The weights in the model continue to get updated. The following screenshot is part of the log, which shows that the training does not end. 1

As you can see, the loss decrease from 1.7 to 1.5 and there also does not exist any falls down on loss scaler.

sharathts commented 2 years ago

Whenever count of skipped_steps increases, that means the loss scaler gets divided by 2. The training is progressing as long as skipped_steps doesn't increase, but Training_Iteration step count increases, as you have observed. Checkpoints will be saved as Training_Iteration count increases..