EveryVoiceTTS / EveryVoice

The EveryVoice TTS Toolkit - Text To Speech for your language
https://docs.everyvoice.ca
Other
20 stars 2 forks source link

When doing a FP fine-tune , in Tensorboard the next round start 50 steps ahead. #550

Open marctessier opened 4 days ago

marctessier commented 4 days ago

Bug description

When doing a FP fine-tune , in Tensorboard it looks like the next round start 50 steps ahead.

See image Screenshot 2024-09-18 at 09 59 35

How to reproduce the bug

Run a FP training . One epoch is good enough to see.

srun everyvoice train text-to-spec config/everyvoice-text-to-spec.yaml --config-args training.max_epochs=1

Then fine-fine that job with an extra epoch.

srun everyvoice train text-to-spec config/everyvoice-text-to-spec.yaml --config-args training.max_epochs=2 --config-args training.finetune_checkpoint="logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt"

Error messages and logs

No error message.

Environment

Standard ENV , nothing special. This will be for after PR 547 is merged. ( Resume at the end of the last trained epoch #547 , Issue #534 )

More info

none

SamuelLarkin commented 3 days ago

I was thinking about this issue. In order to get some training loss value, we need to run at least on batch but then if we use up one batch, we are no longer at step 0 for that resumed run. If we want to remove the gap, should we save the last losses then when we reload the model, we could send those saved values to tensorboard to bridge the gap.

Note that, when we resume, pytorch lightning actually performs one epoch of evaluation then records it to tensorboard then actually proceed to resume training. If we were to use the losses calculated during that first evaluation phase, we could get losses' value at step 0 but they would most likely not align with the training losses' value calculated at the end of the final run aka the run that is prior to resuming aka the values of the last checkpoint we are currently resuming from.