bmaltais / kohya_ss

Apache License 2.0
9.72k stars 1.25k forks source link

Flux LoRA resumes wrong epochs #2771

Open joneschunghk opened 2 months ago

joneschunghk commented 2 months ago

I am training a LoRA with 15 images. I saved a state on epoch 4. The train_state.json file inside x-0000004-state is {"current_epoch": 4, "current_step": 60}

And then I resume x-0000004-state on epoch 5, and saved a state on epoch 25. The train_state.json file inside x-0000025-state is {"current_epoch": 25, "current_step": 60}

And then I resume x-0000025-state on epoch 22, and saved a state on epoch 50. The train_state.json file inside x-0000050-state is {"current_epoch": 50, "current_step": 435}

And then I resume x-0000050-state on epoch 30. I will keep repeating the reverse training

mayjay10 commented 2 months ago

I also have this same issue. As you mention the ‘first’ resume from saved-state works just fine, but any ‘subsequent’ resumes seem only count the steps trained since the ‘last’ saved-state loaded, rather than the ‘total’ number of steps since training began at the initial epoch.

This seems to be because kohya is incorrectly determining the resumed epoch number from the number of steps recorded in the train_state.json file (since last resume) within the selected saved state folder rather than the ‘actual’ number of total epochs, and then resetting the step counter to ‘1’ when continuing training? I don’t know python well enough to fix it, or indeed is this is true reflection of what’s going on ‘under the hood’, but as a manual work-around I have been doing the following:

Go into the saved state folder I wish to resume from and manually edit the train_state.json current step number from the incorrectly recorded number of total steps trained, to the ‘actual’ number of steps trained since epoch zero, and this seems to allow kohya to determine the correct epoch from which to resume.

If you had 15 training images and a batch size of 1 then resuming x-0000025-state from epoch 25 you would manually edit and save the train_state.json from {"current_epoch": 25, "current_step": 60} to {"current_epoch": 25, "current_step": 375}.

This ‘seems’ to then let training recommence from the correct epoch, but the step counter will once more start counting from 1, so this calculation and edit will need to be performed each time a saved-state is resumed. I’m not totally certain if this subsequently continues the training from the correct place, as I am still getting to grips with kohya and flux, but it ‘seems’ to be continuing rather than starting afresh? Hope this helps in some small way....