To recreate: train a text-to-spec model, then add a finetune_ckpt value and resume training (you might need to change another model value like the max number of steps). the losses do not resume properly.

419 Observation

First run shows some overlap between the previous run (cyan) and resuming (pink). Did we miss that behavior previously or is it a new phenomenon?

A more detailed view of the overlap between first run and first resume.

It seems, again, that the global_step isn't consistent and changes during resuming. Resuming occurs at on_fit_start: ce=0 gs=0 and we see that prior gs=2208 which looks like the value in tensorboard then it goes back to gs=2000 which doesn't quite align to tensorboard's value.

2024-09-05 15:32:48.029 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_start: ce=2 gs=1472
2024-09-05 15:32:52.469 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_start: ce=2 gs=1500
2024-09-05 15:32:52.470 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_start: ce=2 gs=1500
2024-09-05 15:34:04.321 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_end: ce=2 gs=1500
2024-09-05 15:34:04.326 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_end: ce=2 gs=1500
2024-09-05 15:35:18.382 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_start: ce=2 gs=2000
2024-09-05 15:35:18.383 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_start: ce=2 gs=2000
2024-09-05 15:36:33.070 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_end: ce=2 gs=2000
2024-09-05 15:36:33.073 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_end: ce=2 gs=2000
2024-09-05 15:37:04.835 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_end: ce=2 gs=2208
2024-09-05 15:37:04.836 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_train_end: ce=3 gs=2208
2024-09-05 15:37:05.100 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_fit_end: ce=3 gs=2208
2024-09-05 15:37:48.261 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_fit_start: ce=0 gs=0
2024-09-05 15:37:48.532 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_train_start: ce=2 gs=2000
2024-09-05 15:37:48.533 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_start: ce=2 gs=2000
2024-09-05 15:37:48.544 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_start: ce=2 gs=2000
2024-09-05 15:37:48.545 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_start: ce=2 gs=2000
2024-09-05 15:38:55.279 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_end: ce=2 gs=2000
2024-09-05 15:38:55.282 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_end: ce=2 gs=2000
2024-09-05 15:39:28.536 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_end: ce=2 gs=2208
2024-09-05 15:39:28.537 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_start: ce=3 gs=2208
2024-09-05 15:40:11.124 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_start: ce=3 gs=2500
2024-09-05 15:40:11.125 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_start: ce=3 gs=2500
2024-09-05 15:41:22.792 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_end: ce=3 gs=2500
2024-09-05 15:41:22.795 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_validation_end: ce=3 gs=2500
2024-09-05 15:42:27.431 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_end: ce=3 gs=2944
2024-09-05 15:42:27.432 | DEBUG    | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_start: ce=4 gs=2944

While tracing the code while resuming from a checkpoint, we should be getting this warning because is_resumable_loader == False and self.global_step == 500 & `expected_steps == 736'

reset [_TrainingEpochLoop] training_epoch_loop.py:161

  if self.global_step % expected_steps != 0 and not is_resumable_loader:
    rank_zero_warn(
      "You're resuming from a checkpoint that ended before the epoch ended and your dataloader is"
      " not resumable. This can cause unreliable results if further training is done."
      " Consider using an end-of-epoch checkpoint or make your dataloader resumable by implementing"
      " the `state_dict` / `load_state_dict` interface.",
      category=PossibleUserWarning,
    )

247

Update: When you know what to look for you can find it in your log.

The following graphs are comprised of two separate runs. Both curves are using the potential fix to this issue which is to set save_on_train_epoch_end=True when instantiating last_ckpt_callback = ModelCheckpoint(). Otherwise, the framework doesn't save a checkpoint at the end of an epoch thus when we resume, we actually resume from a checkpoint that is earlier which causes the staggering of runs.

The orange run has val_check_interval = 0.2 and check_val_every_n_epoch = 1.
The blue run has val_check_interval = 500 and check_val_every_n_epoch = None/null.

One more point to notice, is that there are gaps when we restart and those have to do with the fact that lightning doesn't record the value of train a step 0 when we restart but rather delays it after a couple of updates.

Looking at the following graph, we see that the blue curves are sometimes dotted because during that run/resume, one a single validation was performed.

Since we are performing a single epoch at a time, we can now see clear distinction between epoch

Looking at the previous state of resuming, we can see that each run were spanning multiple epochs which shouldn't have happened in the current run mode of one epoch at a time. We see the spanning across multiple epochs because each colored line are not simply a straight horizontal line but rather are "S" shape-like. One oddity we can also see is, how come the purple line, which represent an epoch, isn't the same length as the other epoch? For that matter, how come the epochs are not all of the same length?

Test Code

#!/bin/bash
# vim:nowrap:
#
#SBATCH --job-name=resume-EV
#SBATCH --partition=gpu_a100
#SBATCH --account=nrc_ict__gpu_a100
#SBATCH --gres=gpu:1
##SBATCH --partition=standard
##SBATCH --account=nrc_ict
#SBATCH --qos=low
#SBATCH --time=720

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=16G

##SBATCH --comment="image=nrc/nrc_all_default_ubuntu-22.04-amd64_latest"
##SBATCH --comment="image=registry.maze-c.collab.science.gc.ca/sschpcs/generic-job:ubuntu22.04_master"

#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
#SBATCH --open-mode=append
##SBATCH --mail-user==sam037
##SBATCH --mail-type=NONE

#SBATCH --signal=B:15@30

## Priority:
#$ low
## Resources:
#
#

function downsize {
   # Given a new-project, change some default values to make everyvoice faster to test.
   sed \
      -i \
      -e '/use_postnet/ s/: .*/: false/' \
      -e '/batch_size/ s/: .*/: 5/' \
      -e '/save_top_k_ckpts/ s/: .*/: 25/' \
      -e '/check_val_every_n_epoch/  s/: .*/: 1/' \
      -e '/val_check_interval/  s/: .*/: 1.0/' \
      config/*
}

ulimit -v unlimited

readonly step_size=1
readonly version=${1:?logger version name?}

# Load up EveryVoice's environment.
source /space/partner/nrc/work/dt/sgile/opt/miniconda3/bin/activate ""
conda activate EveryVoice.sl

ulimit -v unlimited

readonly step_size=1
readonly version=${1:?logger version name?}

# Load up EveryVoice's environment.
source /space/partner/nrc/work/dt/sgile/opt/miniconda3/bin/activate ""
conda activate EveryVoice.sl

{
   # Log some information for post-run debugging.
   head -n 31231231 $0  # Log this script in case we make modificatons between runs
   ( set -o posix; set ) >&2  # What was the shell environment like?
   conda env export   # What was the conda environment like?

   #downsize
   head -n 12312312 config/*.yaml  # What was the configurations that were used?

   ( cd ~/sam037/git/EveryVoice && git diff main..HEAD; )
} &> $SLURM_JOB_NAME-$SLURM_JOB_ID.bookkeeping

echo "========== PREPROCESSING ==========" >&2
[[ -s preprocessed/filelist.psv ]] ||
   srun everyvoice preprocess config/everyvoice-text-to-spec.yaml --cpus $SLURM_CPUS_PER_TASK

echo "========== RESUMING TRAINING FEATURE PREDICTION ==========" >&2
set -o errexit
#   --config-args training.attn_bin_loss_weight=0
# NOTE: if the finetune_checkpoint doesn't exist, training starts from scratch.
#for current_epoch in {0..15..3}; do
for current_epoch in {0..5..1}; do
   # TODO: record each run into its own log + global log
   previous_epoch=$(($current_epoch - 1))
   echo "Training up to $current_epoch"
   model_ckpt="logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/last.ckpt.$previous_epoch"
   sha1sum $model_ckpt || true

   srun everyvoice train text-to-spec \
      config/everyvoice-text-to-spec.yaml \
      --config-args training.finetune_checkpoint="$model_ckpt" \
      --config-args training.max_epochs=$((current_epoch+$step_size)) \
      --config-args training.logger.version=$version

   mv "logs_and_checkpoints/FeaturePredictionExperiment/$version/metrics.csv" \
      "logs_and_checkpoints/FeaturePredictionExperiment/$version/metrics.csv.$current_epoch" \
   || true

   ls -l logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/*
   cp "logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/last.ckpt" \
      "logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/last.ckpt.$current_epoch"
   sha1sum logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/*
done

Let's make sure we always save at the end of a training epoch by adding save_on_train_epoch_end=True

    # This callback will always save the last checkpoint
    # regardless of its performance.
    last_ckpt_callback = ModelCheckpoint(
        save_top_k=1,
        save_last=True,
        every_n_train_steps=config.training.ckpt_steps,
        every_n_epochs=config.training.ckpt_epochs,
        enable_version_counter=True,
        save_on_train_epoch_end=True,
    )

ModelCheckpoint

save_last (Union[bool, Literal['link'], None]) – When True, saves a last.ckpt copy whenever a checkpoint file gets saved. Can be set to 'link' on a local filesystem to create a symbolic link. This allows accessing the latest checkpoint in a deterministic manner. Default: None.
save_top_k (int) – if save_top_k == k, the best k models according to the quantity monitored will be saved. If save_top_k == 0, no models are saved. If save_top_k == -1, all models are saved. Please note that the monitors are checked every every_n_epochs epochs. If save_top_k >= 2 and the callback is called multiple times inside an epoch, and the filename remains unchanged, the name of the saved file will be appended with a version count starting with v1 to avoid collisions unless enable_version_counter is set to False. The version counter is unrelated to the top-k ranking of the checkpoint, and we recommend formatting the filename to include the monitored metric to avoid collisions.
save_on_train_epoch_end (Optional[bool]) – Whether to run checkpointing at the end of the training epoch. If this is False, then the check runs at the end of the validation.
every_n_epochs (Optional[int]) – Number of epochs between checkpoints. This value must be None or non-negative. To disable saving top-k checkpoints, set every_n_epochs = 0. This argument does not impact the saving of save_last=True checkpoints. If all of every_n_epochs, every_n_train_steps and train_time_interval are None, we save a checkpoint at the end of every epoch (equivalent to every_n_epochs = 1). If every_n_epochs == None and either every_n_train_steps != None or train_time_interval != None, saving at the end of each epoch is disabled (equivalent to every_n_epochs = 0). This must be mutually exclusive with every_n_train_steps and train_time_interval. Setting both ModelCheckpoint(..., every_n_epochs=V, save_on_train_epoch_end=False) and Trainer(max_epochs=N, check_val_every_n_epoch=M) will only save checkpoints at epochs 0 < E <= N where both values for every_n_epochs and check_val_every_n_epoch evenly divide E.

EveryVoiceTTS / EveryVoice

Resuming training for text-to-spec model does not resume with appropriate losses #534

419

Observation

247

Test Code