Open joneschunghk opened 2 months ago
I also have this same issue. As you mention the ‘first’ resume from saved-state works just fine, but any ‘subsequent’ resumes seem only count the steps trained since the ‘last’ saved-state loaded, rather than the ‘total’ number of steps since training began at the initial epoch.
This seems to be because kohya is incorrectly determining the resumed epoch number from the number of steps recorded in the train_state.json
file (since last resume) within the selected saved state folder rather than the ‘actual’ number of total epochs, and then resetting the step counter to ‘1’ when continuing training? I don’t know python well enough to fix it, or indeed is this is true reflection of what’s going on ‘under the hood’, but as a manual work-around I have been doing the following:
Go into the saved state folder I wish to resume from and manually edit the train_state.json
current step number from the incorrectly recorded number of total steps trained, to the ‘actual’ number of steps trained since epoch zero, and this seems to allow kohya to determine the correct epoch from which to resume.
If you had 15 training images and a batch size of 1 then resuming x-0000025-state
from epoch 25 you would manually edit and save the train_state.json
from {"current_epoch": 25, "current_step": 60}
to {"current_epoch": 25, "current_step": 375}
.
This ‘seems’ to then let training recommence from the correct epoch, but the step counter will once more start counting from 1, so this calculation and edit will need to be performed each time a saved-state is resumed. I’m not totally certain if this subsequently continues the training from the correct place, as I am still getting to grips with kohya and flux, but it ‘seems’ to be continuing rather than starting afresh? Hope this helps in some small way....
I am training a LoRA with 15 images. I saved a state on epoch 4. The
train_state.json
file insidex-0000004-state
is{"current_epoch": 4, "current_step": 60}
And then I resume
x-0000004-state
on epoch 5, and saved a state on epoch 25. Thetrain_state.json
file insidex-0000025-state
is{"current_epoch": 25, "current_step": 60}
And then I resume
x-0000025-state
on epoch 22, and saved a state on epoch 50. Thetrain_state.json
file insidex-0000050-state
is{"current_epoch": 50, "current_step": 435}
And then I resume
x-0000050-state
on epoch 30. I will keep repeating the reverse training