kohya-LoRA-trainer-XL.ipynb Lora degrades in the process of training, but does not learn

drimeF0 commented 10 months ago

I use the repository from qaneel

config:

[sdxl_arguments]
cache_text_encoder_outputs = false
no_half_vae = true
min_timestep = 0
max_timestep = 1000
shuffle_caption = true
lowram = true

[model_arguments]
pretrained_model_name_or_path = "gsdf/CounterfeitXL"
vae = "/content/vae/sdxl_vae.safetensors"

[dataset_arguments]
debug_dataset = false
in_json = "/content/LoRA/meta_lat.json"
train_data_dir = "/content/LoRA/train_data"
dataset_repeats = 1
keep_tokens = 0
resolution = "1024,1024"
color_aug = false
token_warmup_min = 1
token_warmup_step = 0

[training_arguments]
output_dir = "/content/LoRA/output/sdxl_lora"
output_name = "sdxl_lora"
save_precision = "fp16"
save_every_n_epochs = 1
train_batch_size = 4
max_token_length = 225
mem_eff_attn = false
sdpa = true
xformers = false
max_train_epochs = 10
max_data_loader_n_workers = 8
persistent_data_loader_workers = true
gradient_checkpointing = true
gradient_accumulation_steps = 1
mixed_precision = "fp16"

[logging_arguments]
log_with = "tensorboard"
logging_dir = "/content/LoRA/logs"
log_prefix = "sdxl_lora"

[sample_prompt_arguments]
sample_every_n_epochs = 1
sample_sampler = "euler_a"

[saving_arguments]
save_model_as = "safetensors"

[optimizer_arguments]
optimizer_type = "AdaFactor"
learning_rate = 0.0001
max_grad_norm = 0
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False",]
lr_scheduler = "constant_with_warmup"
lr_warmup_steps = 100

[additional_network_arguments]
no_metadata = false
network_module = "networks.lora"
network_dim = 32
network_alpha = 16
network_args = []
network_train_unet_only = true

The problem is that after each epoch, the quality of generation breaks down dramatically, and at about 3-4 epochs, a terrible mess begins.

Here is an example from the first training epoch: Без имени

And here is an example already at the 2nd epoch of training: Без имени

And finally, the 8th epoch: Без имени

What am I doing wrong, maybe I need to increase the number of epochs?

drimeF0 commented 10 months ago

Today I checked the main code of kohya_ss, the results are the same, so I'm closing the issue here

Linaqruf commented 10 months ago

The problem is probably the datasets, also don't rely on training sample.

drimeF0 commented 10 months ago

The problem is probably the datasets, also don't rely on training sample.

I've tried it 10 times already with different datasets and it always ends up with a terrible result. Today I’ll try to manually select images for training and see what happens

Linaqruf commented 10 months ago

or use lower learning rate

drimeF0 commented 10 months ago

The problem is probably the datasets, also don't rely on training sample. 100 learning steps, 48 pictures from safebooru with the tag "one_eye_closed", I already used my notebook, and apparently I made everything even worse than before. I also noticed this: Could this be a problem with such terrible image quality?

drimeF0 commented 10 months ago

or use lower learning rate

by 0.00002, the deterioration of the quality of the generated images seems to be much slower. I'll see what happens after 1000 steps of learning.

drimeF0 commented 10 months ago

or use lower learning rate

by 0.00002, the deterioration of the quality of the generated images seems to be much slower. I'll see what happens after 1000 steps of learning.

the image style has changed, but LoRA has not yet learned the concept of the one_eye_closed tag Без имени

Linaqruf commented 10 months ago

The problem is probably the datasets, also don't rely on training sample. 100 learning steps, 48 pictures from safebooru with the tag "one_eye_closed", I already used my notebook, and apparently I made everything even worse than before. I also noticed this: Could this be a problem with such terrible image quality?

yes, 512x512 is unusable for sdxl

drimeF0 commented 10 months ago

Без имени

or use lower learning rate

I changed the regular lora to lora_fa from https://github.com/bmaltais/kohya_ss and it worked immediately after just 50 learning steps Also, I set lr to 0.00005

Linaqruf commented 10 months ago

0.0002 or 2e-4 is higher than 1e-4, maybe 1e-5 or 5e-5

drimeF0 commented 10 months ago

The problem is probably the datasets, also don't rely on training sample. 100 learning steps, 48 pictures from safebooru with the tag "one_eye_closed", I already used my notebook, and apparently I made everything even worse than before. I also noticed this: Could this be a problem with such terrible image quality?

yes, 512x512 is unusable for sdxl

how can this be fixed? resize images in the dataset, or add an argument to the command? edit: I found a way, just add --max_resolution 1024,1024 argument to prepare_buckets_latents.py

Linaqruf / kohya-trainer

kohya-LoRA-trainer-XL.ipynb Lora degrades in the process of training, but does not learn #291