More clarity on resuming disconnected training sessions

chromatic-descension commented 1 year ago

The Colab could use a few tweaks to improve resuming a training session that has been disconnected.

For context, I'm training a model with a dozen subjects at full precision (~200 images), which takes about 20 hours and it's expected that Colab Pro will kick me off periodically.

To get it started again, I had to:

Load the old session by name
Go to Google Drive and rename the latest CKPT file from FooBar_step_####.ckpt to FooBar.ckpt
Check the resume training box
Set the total training steps to the original total MINUS however many have been completed

The new value to set for Train_text_encoder_for is not obvious to figure out. Please correct me if I'm wrong, but this is how I understand it.

For example, let's say you initially start training as follows:

Training_Steps = 10,000
Train_text_encoder_for = 70% (7000 steps)
Colab disconnected after 4000 steps

The way Train_text_encoder_for works is by computing a step number at which it stops training the text encoder. So, for the original run it planned to stop training the text encoder at step 7000, but it never made it that far, so it trained the text encoder for all 4000 steps. Now when we restart training, we have 6000 total steps remaining, and we want the text encoder to train for 3000 of these, since it already did 4000 / 7000. Therefore we have to set the Train_text_encoder_for to 50%.

In any case, this is pretty unintuitive. It would be nice if some extra details were written with the CKPT file so that it could smoothly start back up without this hassle. Though maybe folks are typically training smaller models and merging them? I'm not sure what works best here so please LMK!

TheLastBen commented 1 year ago

Concerning the intermediary checkpoints, I will soon add a menu from which to choose which intermediary checkpoint (if found) in case there is no final model in the session's folder.

As for the text encoder, keep it below 2000 steps for better results, and do 1000 steps at the beginning, a 1000 steps at the end. Training the text_encoder for too many steps will lead to overfitting and less divers results.

chromatic-descension commented 1 year ago

Great that sounds good, thanks!

Regarding the text encoder, do you mean 2000 total or per subject? I have 196 pictures for 11 subjects (individual people + pets) that I'm training all at once. So my total training steps is 196 * 200 = 39,200 steps. I had the text encoder set to 70% which amounts to about 27,440 steps or 2500 steps per subject.

TheLastBen commented 1 year ago

set it to 2000 total, and if you don't get accurate faces, resume for another 500, and so on, you have to be careful with the textenc to avoid overfitting. If you train people from the same gender, you might get merged facial features between them.

nawnie commented 1 year ago

not sure you will need that long, ive been getting great results with sets of 1k images. something that seems to help when training alot of subjects is to use the train batch size option and enable the use less vram option. how i understand it this will reduce how often you make an edit to the original model ie at 4 you get the info from four images before writting.

if your willing you could run around 22ish using the premium GPU, however that seems to not yield the greatest results, best ive had is around a batch of 8.

** i do all my text encoder training on a batch size of 1. no real reason just havn't played with that yet.

nawnie commented 1 year ago

Concerning the intermediary checkpoints, I will soon add a menu from which to choose which intermediary checkpoint (if found) in case there is no final model in the session's folder.

As for the text encoder, keep it below 2000 steps for better results, and do 1000 steps at the beginning, a 1000 steps at the end. Training the text_encoder for too many steps will lead to overfitting and less divers results.

cant wait for this, colab file explorer likes to die on me :D

TheLastBen commented 1 year ago

cant wait for this, colab file explorer likes to die on me :D

https://github.com/TheLastBen/fast-stable-diffusion/commit/364bf52c80473d08640c557be0e5c8daa5661f58

slurpey commented 1 year ago

Hi, I'm also a bit confused. In the example below, I would simply like to conclude the training and test the model. What should I do?

TheLastBen commented 1 year ago

choose the number of the model that you want to use, I'll add a more clear instruction

TheLastBen / fast-stable-diffusion

More clarity on resuming disconnected training sessions #597