AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
142.6k stars 26.9k forks source link

[Feature Request]: Automatic saving of best loss model when training #3389

Open R-N opened 2 years ago

R-N commented 2 years ago

Is there an existing issue for this?

What would your feature do ?

When you train embedding or hypernetwork, the webui is designed save every x steps. But the lowest loss model might be lost between those steps. Also, simply saving lowest loss tend to overfit.

Therefore, I want webui to support saving history/log of best models. So, if the training produces better loss, webui will generate an image and save it and the model (just like the current mechanism).

I want to be able to specify the best loss threshold, 0 meaning even the smallest reduction from best loss will be saved. But such threshold might not work well as training goes and difference in loss gets way smaller. So maybe the threshold can be progressive(?) like the training rate. Or it can be percentage of best loss.

It might be nice to be able to enable or disable both the current saving mechanism and the best loss saving mechanism as separate toggles, so it's possible to have both of them on. The option to specify separate log directory would also be nice. Otherwise I might need to turn off the current mechanism just to not mix them up.

Proposed workflow

  1. Go to training tab
  2. Select train tab
  3. Tick "log best models" (there might be better label)
  4. Specify loss threshold
  5. Maybe specify a different log directory
  6. Press train
  7. Webui will log training everytime the training produces better loss (mind the threshold)

Additional information

I might be able to work on this. I havent understood the project structure though, nor am I familiar with gradio.

Parameters are from this file Training is done on this file

CodeExplode commented 2 years ago

The way that loss is calculated is it slightly adds noise to an image, then asks the AI to guess what parts are noise and should be removed (which is how stable diffusion works in multiple steps), and the closer it is to predicting what the added noise is, the lower the loss. It can fluctuate pretty wildly depending on the image. The most recent 32 losses are averaged to give the current loss.

(There's probably also general noise and imperfections which the AI should be removing in a given sample image, so it seems unlikely that you'd ever get perfect loss which exactly identifies the randomly added noise parts)

AFAIK the training process will try to nudge the embedding weights more the larger the loss is, and leave them more stable when there's less loss, so that hopefully they'll eventually jitter into place and settle on a solution which satisfies all the training images the best.

From what I've seen, loss tends to fluctuate around the same value for a long time even as results massively change, so I'm not 100% sure if it's the best way of understanding if training is getting better. You'd at least probably want the average loss for your entire training data set, rather than just the most recent, or the last 32.

R-N commented 2 years ago

The way that loss is calculated is it slightly adds noise to an image, then asks the AI to guess what parts are noise and should be removed (which is how stable diffusion works in multiple steps), and the closer it is to predicting what the added noise is, the lower the loss. It can fluctuate pretty wildly depending on the image.

So the noise added is mostly artificial then, thus the loss too. Is the added noise controlled to always have the same total effect to the loss? Or is it random?

The most recent 32 losses are averaged to give the current loss.

So I misunderstood the current loss then. What would knowing most recent 32 loss be useful for?

From what I've seen, loss tends to fluctuate around the same value for a long time even as results massively change, so I'm not 100% sure if it's the best way of understanding if training is getting better. You'd at least probably want the average loss for your entire training data set, rather than just the most recent, or the last 32.

Do you have a suggestion for a better evaluation metric?

R-N commented 2 years ago

Wait, the model and image preview are logged every x step instead of epoch. So they're logged and saved in the middle of an epoch?

It's understandable for training the main model with million of images. But for hypernetwork and embedding with rather small dataset and quick epoch..

R-N commented 2 years ago

I just realized that the loss in CSV is set in the settings. It's 100 for me, might be 32 for you. If I set this and the model log to the size of the dataset I should be getting them done every epoch.

Edit: Yeah it works

R-N commented 2 years ago

Wait, no. It's true that the loss is now logged every 1 epoch, but it's late by 1 step. It seems to be logging the very first 1 step, then it proceeds to log every specified step. Why?

For example if I were to set it to log every 100 step. Then it will log step 1, 101, 201, ...