Open roedoejet opened 1 month ago
First run shows some overlap between the previous run (cyan) and resuming (pink). Did we miss that behavior previously or is it a new phenomenon?
A more detailed view of the overlap between first run and first resume.
It seems, again, that the global_step
isn't consistent and changes during resuming. Resuming occurs at on_fit_start: ce=0 gs=0
and we see that prior gs=2208
which looks like the value in tensorboard
then it goes back to gs=2000
which doesn't quite align to tensorboard
's value.
2024-09-05 15:32:48.029 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_start: ce=2 gs=1472
2024-09-05 15:32:52.469 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_start: ce=2 gs=1500
2024-09-05 15:32:52.470 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_start: ce=2 gs=1500
2024-09-05 15:34:04.321 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_end: ce=2 gs=1500
2024-09-05 15:34:04.326 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_end: ce=2 gs=1500
2024-09-05 15:35:18.382 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_start: ce=2 gs=2000
2024-09-05 15:35:18.383 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_start: ce=2 gs=2000
2024-09-05 15:36:33.070 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_end: ce=2 gs=2000
2024-09-05 15:36:33.073 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_end: ce=2 gs=2000
2024-09-05 15:37:04.835 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_end: ce=2 gs=2208
2024-09-05 15:37:04.836 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_train_end: ce=3 gs=2208
2024-09-05 15:37:05.100 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_fit_end: ce=3 gs=2208
2024-09-05 15:37:48.261 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_fit_start: ce=0 gs=0
2024-09-05 15:37:48.532 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_train_start: ce=2 gs=2000
2024-09-05 15:37:48.533 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_start: ce=2 gs=2000
2024-09-05 15:37:48.544 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_start: ce=2 gs=2000
2024-09-05 15:37:48.545 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_start: ce=2 gs=2000
2024-09-05 15:38:55.279 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_end: ce=2 gs=2000
2024-09-05 15:38:55.282 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_end: ce=2 gs=2000
2024-09-05 15:39:28.536 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_end: ce=2 gs=2208
2024-09-05 15:39:28.537 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_start: ce=3 gs=2208
2024-09-05 15:40:11.124 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_start: ce=3 gs=2500
2024-09-05 15:40:11.125 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_start: ce=3 gs=2500
2024-09-05 15:41:22.792 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_epoch_end: ce=3 gs=2500
2024-09-05 15:41:22.795 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_validation_end: ce=3 gs=2500
2024-09-05 15:42:27.431 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_end: ce=3 gs=2944
2024-09-05 15:42:27.432 | DEBUG | everyvoice.base_cli.callback:log_debug:47 - on_train_epoch_start: ce=4 gs=2944
While tracing the code while resuming from a checkpoint, we should be getting this warning because is_resumable_loader == False
and self.global_step == 500
& `expected_steps == 736'
reset [_TrainingEpochLoop] training_epoch_loop.py:161
if self.global_step % expected_steps != 0 and not is_resumable_loader:
rank_zero_warn(
"You're resuming from a checkpoint that ended before the epoch ended and your dataloader is"
" not resumable. This can cause unreliable results if further training is done."
" Consider using an end-of-epoch checkpoint or make your dataloader resumable by implementing"
" the `state_dict` / `load_state_dict` interface.",
category=PossibleUserWarning,
)
Update: When you know what to look for you can find it in your log.
The following graphs are comprised of two separate runs.
Both curves are using the potential fix to this issue which is to set save_on_train_epoch_end=True
when instantiating last_ckpt_callback = ModelCheckpoint()
. Otherwise, the framework doesn't save a checkpoint at the end of an epoch thus when we resume, we actually resume from a checkpoint that is earlier which causes the staggering of runs.
val_check_interval = 0.2
and check_val_every_n_epoch = 1
.val_check_interval = 500
and check_val_every_n_epoch = None/null
.One more point to notice, is that there are gaps when we restart and those have to do with the fact that lightning doesn't record the value of train a step 0 when we restart but rather delays it after a couple of updates.
Looking at the following graph, we see that the blue curves are sometimes dotted because during that run/resume, one a single validation was performed.
Since we are performing a single epoch at a time, we can now see clear distinction between epoch
Looking at the previous state of resuming, we can see that each run were spanning multiple epochs which shouldn't have happened in the current run mode of one epoch at a time. We see the spanning across multiple epochs because each colored line are not simply a straight horizontal line but rather are "S" shape-like. One oddity we can also see is, how come the purple line, which represent an epoch, isn't the same length as the other epoch? For that matter, how come the epochs are not all of the same length?
#!/bin/bash
# vim:nowrap:
#
#SBATCH --job-name=resume-EV
#SBATCH --partition=gpu_a100
#SBATCH --account=nrc_ict__gpu_a100
#SBATCH --gres=gpu:1
##SBATCH --partition=standard
##SBATCH --account=nrc_ict
#SBATCH --qos=low
#SBATCH --time=720
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=16G
##SBATCH --comment="image=nrc/nrc_all_default_ubuntu-22.04-amd64_latest"
##SBATCH --comment="image=registry.maze-c.collab.science.gc.ca/sschpcs/generic-job:ubuntu22.04_master"
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
#SBATCH --open-mode=append
##SBATCH --mail-user==sam037
##SBATCH --mail-type=NONE
#SBATCH --signal=B:15@30
## Priority:
#$ low
## Resources:
#
#
function downsize {
# Given a new-project, change some default values to make everyvoice faster to test.
sed \
-i \
-e '/use_postnet/ s/: .*/: false/' \
-e '/batch_size/ s/: .*/: 5/' \
-e '/save_top_k_ckpts/ s/: .*/: 25/' \
-e '/check_val_every_n_epoch/ s/: .*/: 1/' \
-e '/val_check_interval/ s/: .*/: 1.0/' \
config/*
}
ulimit -v unlimited
readonly step_size=1
readonly version=${1:?logger version name?}
# Load up EveryVoice's environment.
source /space/partner/nrc/work/dt/sgile/opt/miniconda3/bin/activate ""
conda activate EveryVoice.sl
ulimit -v unlimited
readonly step_size=1
readonly version=${1:?logger version name?}
# Load up EveryVoice's environment.
source /space/partner/nrc/work/dt/sgile/opt/miniconda3/bin/activate ""
conda activate EveryVoice.sl
{
# Log some information for post-run debugging.
head -n 31231231 $0 # Log this script in case we make modificatons between runs
( set -o posix; set ) >&2 # What was the shell environment like?
conda env export # What was the conda environment like?
#downsize
head -n 12312312 config/*.yaml # What was the configurations that were used?
( cd ~/sam037/git/EveryVoice && git diff main..HEAD; )
} &> $SLURM_JOB_NAME-$SLURM_JOB_ID.bookkeeping
echo "========== PREPROCESSING ==========" >&2
[[ -s preprocessed/filelist.psv ]] ||
srun everyvoice preprocess config/everyvoice-text-to-spec.yaml --cpus $SLURM_CPUS_PER_TASK
echo "========== RESUMING TRAINING FEATURE PREDICTION ==========" >&2
set -o errexit
# --config-args training.attn_bin_loss_weight=0
# NOTE: if the finetune_checkpoint doesn't exist, training starts from scratch.
#for current_epoch in {0..15..3}; do
for current_epoch in {0..5..1}; do
# TODO: record each run into its own log + global log
previous_epoch=$(($current_epoch - 1))
echo "Training up to $current_epoch"
model_ckpt="logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/last.ckpt.$previous_epoch"
sha1sum $model_ckpt || true
srun everyvoice train text-to-spec \
config/everyvoice-text-to-spec.yaml \
--config-args training.finetune_checkpoint="$model_ckpt" \
--config-args training.max_epochs=$((current_epoch+$step_size)) \
--config-args training.logger.version=$version
mv "logs_and_checkpoints/FeaturePredictionExperiment/$version/metrics.csv" \
"logs_and_checkpoints/FeaturePredictionExperiment/$version/metrics.csv.$current_epoch" \
|| true
ls -l logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/*
cp "logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/last.ckpt" \
"logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/last.ckpt.$current_epoch"
sha1sum logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/*
done
Let's make sure we always save at the end of a training epoch by adding save_on_train_epoch_end=True
# This callback will always save the last checkpoint
# regardless of its performance.
last_ckpt_callback = ModelCheckpoint(
save_top_k=1,
save_last=True,
every_n_train_steps=config.training.ckpt_steps,
every_n_epochs=config.training.ckpt_epochs,
enable_version_counter=True,
save_on_train_epoch_end=True,
)
save_last (Union[bool, Literal['link'], None]) – When True, saves a last.ckpt copy whenever a checkpoint file gets saved. Can be set to 'link' on a local filesystem to create a symbolic link. This allows accessing the latest checkpoint in a deterministic manner. Default: None.
save_top_k (int) – if save_top_k == k, the best k models according to the quantity monitored will be saved. If save_top_k == 0, no models are saved. If save_top_k == -1, all models are saved. Please note that the monitors are checked every every_n_epochs epochs. If save_top_k >= 2 and the callback is called multiple times inside an epoch, and the filename remains unchanged, the name of the saved file will be appended with a version count starting with v1 to avoid collisions unless enable_version_counter is set to False. The version counter is unrelated to the top-k ranking of the checkpoint, and we recommend formatting the filename to include the monitored metric to avoid collisions.
save_on_train_epoch_end (Optional[bool]) – Whether to run checkpointing at the end of the training epoch. If this is False, then the check runs at the end of the validation.
every_n_epochs (Optional[int]) – Number of epochs between checkpoints. This value must be None or non-negative. To disable saving top-k checkpoints, set every_n_epochs = 0. This argument does not impact the saving of save_last=True checkpoints. If all of every_n_epochs, every_n_train_steps and train_time_interval are None, we save a checkpoint at the end of every epoch (equivalent to every_n_epochs = 1). If every_n_epochs == None and either every_n_train_steps != None or train_time_interval != None, saving at the end of each epoch is disabled (equivalent to every_n_epochs = 0). This must be mutually exclusive with every_n_train_steps and train_time_interval. Setting both ModelCheckpoint(..., every_n_epochs=V, save_on_train_epoch_end=False) and Trainer(max_epochs=N, check_val_every_n_epoch=M) will only save checkpoints at epochs 0 < E <= N where both values for every_n_epochs and check_val_every_n_epoch evenly divide E.
To recreate: train a text-to-spec model, then add a
finetune_ckpt
value and resume training (you might need to change another model value like the max number of steps). the losses do not resume properly.Maybe related to https://github.com/EveryVoiceTTS/EveryVoice/pull/473
419