coqui-ai / Trainer

🐸 - A general purpose model trainer, as flexible as it gets
196 stars 116 forks source link

Multiples bug fixes and add on_train_epoch_start callback #129

Closed Edresson closed 11 months ago

Edresson commented 11 months ago

What it does?

  1. Solve KeyError: 'avg_loss_1' error when start_with_eval=True and target_loss is settled. This issue happens because the training keep_avg_target.avg_values is an empty dictionary. It is related to https://github.com/coqui-ai/TTS/issues/2862. This error also happens in training when we try to save a checkpoint before we have updated self.keep_avg_train or self.keep_avg_eval. To solve it this PR also make _pick_target_avg_loss safe and it avoid issues like https://github.com/coqui-ai/TTS/issues/1608 to happens, if the keep_avg_target.avg_values is empty it will return None and all will be good.
  2. It also raises an error if multiple-optimizer setup with grad accumulation and without a custom optimize method. It avoids the user training the model with our implementation that has dangling gradients in multiple-optimizer setup with grad accumulation (I already did it accidently, It is really bad because we can lose training time).
  3. It added on_train_epoch_start and on_train_epoch_end callbacks. Currently, the only way to put modules in eval mode model during the training is via on_train_step_start callback, that is called each train_step. It is really slow. Adding this new callback we can do it only one time per epoch. It should decrease the step time for XTTS GPT and XTTS decoder training.