Training doesn't resume from previous checkpoint using max_train_steps

bghira / SimpleTuner

A general fine-tuning kit geared toward diffusion models.

GNU Affero General Public License v3.0

1.83k stars 175 forks source link

Training doesn't resume from previous checkpoint using max_train_steps #1172

Open playerzer0x opened 3 days ago

playerzer0x commented 3 days ago

I train a model to 10k steps
I change max_train_steps in config to 15000
I change out the data loader with an updated multidatabackend
I start training and receive this error: 2024-11-21 01:52:34,920 [INFO] Reached the end (58 epochs) of our training run (42 epochs). This run will do zero steps.
Training doesn't continue

If I set max_train_steps to 0 and change num_train_epochs to 100, training starts fine. Haven't counted, but the updated dataset for resume may be less than the original dataset used.

My brain thinks in steps, so would prefer to use steps over epochs.

bghira commented 3 days ago

well, that is normal. you are no longer resuming the old training run, as you have changed everything.

it's not really recommended to change anything within a single training run, let alone the entire dataset or the step schedule

playerzer0x commented 2 days ago

This change would be across two separate training runs. I'm following Caith's recommendation on training new subjects into a "base LoKR" that was previously trained on styles.

bghira commented 2 days ago

you want to use --init_lora to begin a new training run from the old lokr then. it takes a path to the safetensor file

playerzer0x commented 21 hours ago

you want to use --init_lora to begin a new training run from the old lokr then. it takes a path to the safetensor file

Tried this but trainer threw a tensor size error on first step. Went back to using epochs and training starts fine.