Nerogar / OneTrainer

OneTrainer is a one-stop solution for all your stable diffusion training needs.
GNU Affero General Public License v3.0
1.65k stars 132 forks source link

The UI has too many needlessly sharp corners #220

Open mx opened 5 months ago

mx commented 5 months ago

What happened?

There are too many times where the user interface allows a combination of options that will deterministically cause an exception or error at some point in the training process. The canonical example now is putting a non-diffusers-format VAE in the VAE path. How many duplicate bugs, and requests for assistance in the Discord, have we seen from that one?

That's not the only one though. Things like using relative step Adafactor without the Adafactor scheduler. Swapping the optimizer choice in the UI while training is going on (though this one is probably more of a race/lifetime issue). FP16/FP32 mismatches. These are all issues the UI can, and should, catch; there's no reason that bad combinations should ever reach the point where they cause an exception.

What did you expect would happen?

Let's make it so the UI disallows these before they ever hit the backend and give it the chance to error.

Relevant log output

No response

Output of pip freeze

No response

O-J1 commented 5 months ago

Since Nerogar asked, additional common issues (at least when I last checked):

Issues

Allowing users to start training with:

  1. No active concepts
  2. Active concepts that have either no images or captions
  3. Active concepts but there is a mismatch between the number of captions and images (usually indicating a copy paste error)
  4. Corrupt images in the concept (Our group started caching 4M images and we discovered 8 images were corrupt lol..). We solved this with ImageMagick scripting, another smaller example of this is #219 in the 2nd comment
  5. Masked training is enabled but there are either no masks or missing/corrupt masks. (The latter should only require a confirmation, since it might be intentional, allow the user to dismiss this until OT is reopened)
  6. Hitting stop can take forever on larger finetunes because it wont interupt. (Havent checked recently as Ive been building tools for our finetune)

Subjective common issues:

  1. Allowing LR scaling with Prodigy/other auto adjusting LR optimisers, completely frying the model
  2. Allowing users to set repeats when there is only one concept (its intended for balancing multiple concepts)
  3. Allowing users to train with rescale noise reschedule on non 2.1 models
  4. Allowing users to set the LR on prodigy to not be 1
311-code commented 5 months ago

Wow. Yeah totally agree. Because I didn't know I was doing half of that stuff in your list wrong. Haha

I also just ran into that corrupted image issue today btw, while caching 113k images. I did not know about the 1 repeat thing for a single concept either.

Allowing users to set repeats when there is only one concept (its intended for balancing multiple concepts)

I would not want to lose this though, as I sometimes use 20 repeat with no reg because because it takes a lot longer to train on 1 repeat, for me it still looks good on 20, but I'm doing something wrong then. Maybe just a note about it.

Any chance anyone has a good sdxl preset for larger datasets with 500 images and 24gb vram? I can get batch size 2 with ema working just barely when closing tasks like explorer.exe, but unsure of some of these settings combinations.

Blackhol3 commented 5 months ago

Less critically, but in the same spirit, it would be nice if parameters that are disabled by others were greyed out and/or disabled in the interface. For instance:

311-code commented 5 months ago

Allowing users to train with rescale noise reschedule on non 2.1 models

Did not know this, I have been training SDXL with this on. Turning off now.

EMA parameters when "EMA" is "OFF"

Oh wow, is this a thing? My EMA is off and it's still at 1 update step internal and ema decay.

^ This one is definitely an issue. lol

Adding to list: I would also like to see an extra button to pause training and start training, that eliminates fear. I have quickly stopped training resumed so many times, only to find I forgot to turn off "clear cache" or it didn't resume from backup and trained for hours.

It seems it would be easy to just have it do the settings automatically when clicking pause then resume. Keeping stop training as it is now for more control.

mx commented 5 months ago

Allowing users to train with rescale noise reschedule on non 2.1 models

This one, at least, is a valid use case and should be kept available.

311-code commented 5 months ago

This one, at least, is a valid use case and should be kept available.

Yeah, I had assumed rescale noise reschedule didn't work at first for sdxl because it had sampling errors and won't sample for me with it on during training, but maybe then it does for work non-2.1, because my model worked totally fine. (I had EMA on also)

It does seem to work for SDXL so the loss of sampling ability confusing people: Here's a thread, and here is comfyanonymous comments on using it fully in comfyui and paper to take advantage of model trained with this setting. Use comfyui manager and install 'ComfyUI_experiments', then add the 'RescaleClassifierFreeGuidanceTest' node after the checkpoint, use k-diffusion scheduler like normal and karras.

This node seems to creates a lot of noise if you didn't turn the setting on in Onetrainer, but looks great if you trained with it on (just tested) Getting better results with clip skip -3 setting in comfyui also.

madman404 commented 5 months ago

This one, at least, is a valid use case and should be kept available.

Yeah, I had assumed rescale noise reschedule didn't work at first for sdxl because it had sampling errors and won't sample for me with it on during training, but maybe then it does for work non-2.1, because my model worked totally fine. (I had EMA on also)

It does seem to work for SDXL so the loss of sampling ability confusing people: Here's a thread, and here is comfyanonymous comments on using it fully in comfyui and paper to take advantage of model trained with this setting. Use comfyui manager and install 'ComfyUI_experiments', then add the 'RescaleClassifierFreeGuidanceTest' node after the checkpoint, use k-diffusion scheduler like normal and karras.

This node seems to creates a lot of noise if you didn't turn the setting on in Onetrainer, but looks great if you trained with it on (just tested) Getting better results with clip skip -3 setting in comfyui also.

It doesn't really have anything to do with this - do not use a v-prediction noise schedule on a model that was not trained to accept it. It is still valid, however, because there are several SD 1.5 models trained to use v-prediction, i.e. Zerodiffusion.

MysticDaedra commented 5 months ago

Repeats with a single concept is pretty standard practice in LoRA training to effectively increase a small dataset size. If it is a bad practice, I'd love to know why. People have been doing this since Kohya was first a thing.

Nerogar commented 5 months ago

Repeats with a single concept is pretty standard practice in LoRA training to effectively increase a small dataset size. If it is a bad practice, I'd love to know why. People have been doing this since Kohya was first a thing.

It's a bad practice, because it's not needed. Training is done in epochs. One epoch is an iteration over your entire dataset. Repeats only increase the number of trained steps of one concept within each epoch. This makes balancing of datasets harder.

If you want to train a single concept for more steps, just increase the number of epochs.

FurkanGozukara commented 5 months ago

Repeats with a single concept is pretty standard practice in LoRA training to effectively increase a small dataset size. If it is a bad practice, I'd love to know why. People have been doing this since Kohya was first a thing.

It's a bad practice, because it's not needed. Training is done in epochs. One epoch is an iteration over your entire dataset. Repeats only increase the number of trained steps of one concept within each epoch. This makes balancing of datasets harder.

If you want to train a single concept for more steps, just increase the number of epochs.

I hate forced repeating logic of Kohya. OneTrainer much better from this perspective

O-J1 commented 2 months ago

As per discussion in discord, adding UI issues here for centralised tracking:

https://github.com/Nerogar/OneTrainer/issues/377 https://github.com/Nerogar/OneTrainer/issues/34 https://github.com/Nerogar/OneTrainer/issues/112 https://github.com/Nerogar/OneTrainer/issues/115 https://github.com/Nerogar/OneTrainer/issues/141 https://github.com/Nerogar/OneTrainer/issues/142 https://github.com/Nerogar/OneTrainer/issues/211 https://github.com/Nerogar/OneTrainer/issues/258 https://github.com/Nerogar/OneTrainer/issues/284 https://github.com/Nerogar/OneTrainer/issues/275#issuecomment-2134975356 https://github.com/Nerogar/OneTrainer/issues/284 https://github.com/Nerogar/OneTrainer/issues/285#issuecomment-2171682890 https://github.com/Nerogar/OneTrainer/issues/310 https://github.com/Nerogar/OneTrainer/issues/327 https://github.com/Nerogar/OneTrainer/issues/351 https://github.com/Nerogar/OneTrainer/issues/358 https://github.com/Nerogar/OneTrainer/issues/362 https://github.com/Nerogar/OneTrainer/issues/366 https://github.com/Nerogar/OneTrainer/issues/380