Solve KeyError: 'avg_loss_1' error when start_with_eval=True and target_loss is settled. This issue happens because the training keep_avg_target.avg_values is an empty dictionary. It is related to https://github.com/coqui-ai/TTS/issues/2862. This error also happens in training when we try to save a checkpoint before we have updated self.keep_avg_train or self.keep_avg_eval. To solve it this PR also make _pick_target_avg_loss safe and it avoid issues like https://github.com/coqui-ai/TTS/issues/1608 to happens, if the keep_avg_target.avg_values is empty it will return None and all will be good.
It also raises an error if multiple-optimizer setup with grad accumulation and without a custom optimize method. It avoids the user training the model with our implementation that has dangling gradients in multiple-optimizer setup with grad accumulation (I already did it accidently, It is really bad because we can lose training time).
It added on_train_epoch_start and on_train_epoch_end callbacks. Currently, the only way to put modules in eval mode model during the training is via on_train_step_start callback, that is called each train_step. It is really slow. Adding this new callback we can do it only one time per epoch. It should decrease the step time for XTTS GPT and XTTS decoder training.
What it does?
KeyError: 'avg_loss_1'
error when start_with_eval=True and target_loss is settled. This issue happens because the training keep_avg_target.avg_values is an empty dictionary. It is related to https://github.com/coqui-ai/TTS/issues/2862. This error also happens in training when we try to save a checkpoint before we have updated self.keep_avg_train or self.keep_avg_eval. To solve it this PR also make _pick_target_avg_loss safe and it avoid issues like https://github.com/coqui-ai/TTS/issues/1608 to happens, if the keep_avg_target.avg_values is empty it will return None and all will be good.optimize
method. It avoids the user training the model with our implementation that has dangling gradients in multiple-optimizer setup with grad accumulation (I already did it accidently, It is really bad because we can lose training time).on_train_epoch_start
andon_train_epoch_end
callbacks. Currently, the only way to put modules in eval mode model during the training is viaon_train_step_start
callback, that is called each train_step. It is really slow. Adding this new callback we can do it only one time per epoch. It should decrease the step time for XTTS GPT and XTTS decoder training.