SamuelLarkin commented 2 weeks ago

PR Goal?

Fix proper resuming of text-to-spec training. The state at the end of the last epoch wasn't saved and resuming would be performed from the last saved checkpoint that was the last checkpoint used for validation. This was producing staggered runs as shown in tensorboard.

Fixes?

534

Feedback sought?

merge approval

Priority?

low

Tests added?

None

How to test?

   srun everyvoice train text-to-spec \
      config/everyvoice-text-to-spec.yaml \
      --config-args training.max_epochs=1 \

Check the state of the loops

python -c 'import torch; import json; m = torch.load("logs_and_checkpoints/FeaturePredictionExperiment/save_on_train_epoch_end/checkpoints/last.ckpt", map_location=torch.device("cpu")); print(json.dumps(m["loops"]["fit_loop"]["epoch_loop.batch_progress"], indent=2))'

Which will yield something like the following. You want to look at current's values. This run used 11790 training examples split across batches of 16 examples thus, one epoch is 11790/16 ~ 736 batches per epoch. If, instead, we see 500, the default val_check_interval, this would mean that we didn't save at the end of the epoch.

{
  "total": {
    "ready": 4421,
    "completed": 4421,
    "started": 4421,
    "processed": 4421
  },
  "current": {
    "ready": 736,
    "completed": 736,
    "started": 736,
    "processed": 736
  },
  "is_last_batch": true
}

Try resuming for a second epoch.

   srun everyvoice train text-to-spec \
      config/everyvoice-text-to-spec.yaml \
      --config-args training.finetune_checkpoint="logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt" \
      --config-args training.max_epochs=2 \

Use tensorboard and check that the second run's training is NOT staggered with your first run.

tensorboard --port=2024 --logdir=logs_and_checkpoints  --bind_all

Confidence?

Good

Version change?

No

Related PRs?

None

semanticdiff-com[bot] commented 2 weeks ago

Review changes with SemanticDiff.

Analyzed 1 of 1 files.

	Filename	Status
:heavy_check_mark:	everyvoice/base_cli/helpers.py	Analyzed

github-actions[bot] commented 2 weeks ago

CLI load time: 0:00.23
Pull Request HEAD: 7cce58cb74a59ca919153ce22f72e49f4ee64024
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 74.63%. Comparing base (3a36240) to head (7cce58c). Report is 1 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #547 +/- ## ======================================= Coverage 74.63% 74.63% ======================================= Files 46 46 Lines 3130 3130 Branches 510 510 ======================================= Hits 2336 2336 Misses 693 693 Partials 101 101 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

marctessier commented 2 weeks ago

Yes , confirming that the fin-tune checkpoint it resuming from the end of the previous run. ( 50 steps ahead) VS how it was definitely overlapping before.

I will open a new ticket for the 50 steps ahead but will close this since it is now resolved. :-)

EveryVoiceTTS / EveryVoice

Resume at the end of the last trained epoch #547