Flux LoRA resumes wrong epochs

I also have this same issue. As you mention the ‘first’ resume from saved-state works just fine, but any ‘subsequent’ resumes seem only count the steps trained since the ‘last’ saved-state loaded, rather than the ‘total’ number of steps since training began at the initial epoch.

This seems to be because kohya is incorrectly determining the resumed epoch number from the number of steps recorded in the train_state.json file (since last resume) within the selected saved state folder rather than the ‘actual’ number of total epochs, and then resetting the step counter to ‘1’ when continuing training? I don’t know python well enough to fix it, or indeed is this is true reflection of what’s going on ‘under the hood’, but as a manual work-around I have been doing the following:

Go into the saved state folder I wish to resume from and manually edit the train_state.json current step number from the incorrectly recorded number of total steps trained, to the ‘actual’ number of steps trained since epoch zero, and this seems to allow kohya to determine the correct epoch from which to resume.

If you had 15 training images and a batch size of 1 then resuming x-0000025-state from epoch 25 you would manually edit and save the train_state.json from {"current_epoch": 25, "current_step": 60} to {"current_epoch": 25, "current_step": 375}.

This ‘seems’ to then let training recommence from the correct epoch, but the step counter will once more start counting from 1, so this calculation and edit will need to be performed each time a saved-state is resumed. I’m not totally certain if this subsequently continues the training from the correct place, as I am still getting to grips with kohya and flux, but it ‘seems’ to be continuing rather than starting afresh? Hope this helps in some small way....

bmaltais / kohya_ss

Flux LoRA resumes wrong epochs #2771