Readme documentation does not explain how to resume a job from a previous LoRA epoch checkpoint

derrian-distro / LoRA_Easy_Training_Scripts

A UI made in Pyside6 to make training LoRA/LoCon and other LoRA type models in sd-scripts easy

GNU General Public License v3.0

1.06k stars 103 forks source link

Readme documentation does not explain how to resume a job from a previous LoRA epoch checkpoint #198

Closed peteer01 closed 4 months ago

peteer01 commented 6 months ago

Apologies if this is not the right places to ask for this request, but I am unable to figure out how to resume training from a saved epoch checkpoint. Is it possible to update the readme documentation to explain how to use LoRA Trainer to continue training a previous training run?

Jelosus2 commented 6 months ago

You use this options to save the state of the training Screenshot_3 And then when you want to resume you have to input the folder where the state was saved here Screenshot_4

peteer01 commented 6 months ago

That makes sense! Thank you for sharing that. It might help others if that can be added to the readme.

Jelosus2 commented 6 months ago

That depends on him, but I guess he will once the trainer gets an update. Also np, happy to help :)

Jelosus2 commented 6 months ago

Also note that saving the state saves all the state, including checkpoint used, optimizers, current learning rate, etc. So each state would take ~6.5GB per epoch on XL training, so it's best to save only the last state your able to train.

Jelosus2 commented 4 months ago

Solved