bmaltais / kohya_ss

Apache License 2.0
9.34k stars 1.21k forks source link

Finished checkpoint gives insanely bad results #2188

Open Deathawaits4 opened 5 months ago

Deathawaits4 commented 5 months ago

Hello, i have an issue that when training a checkpoint i get very very very good samples. They look exactly how they should. Yet the finished checkpoint with the same sampler and same settings, gives completely garbled mess that doesnt even remotely resemble what i trained.

Why is this happening?

5KilosOfCheese commented 5 months ago

Can your provide the following things:

  1. The training settings you used (preferably the "print training command" output, by just copy pasting it.
  2. General overview of the samples, the generated images and your dataset (this is important because otherwise we can't know how it went wrong) and if you used captions the just a example caption.
  3. The settings you generated the images with
  4. Example of the Lora's behavior by generating one seed with strenght of 1, 0.75, 0.5, 0,25, and without LoRA.

Lets see if we can figure out what went wrong. Because just based on your description I can't really give you any information. Because to an untrained eye undercooked (too little training) and overcooked (overfit from too much training) look largely the same.

311-code commented 5 months ago

Not sure if this is problem but, cant really count on final checkpoint, could be overtrained. Save as it goes every number of epochs you set to save, try to set it to save checkpoint around where you are getting the good samples

If thats not the problem then make sure you are actually sampling at 1024x1024 --w 1024 --h 1024 in sampling box if your doing sdxl.

Vehnum commented 5 months ago

I am having the same issue, it sometimes outputs good images but mostly the images are completely garbled.

here are the training settings I used kohya_ss/sd-scripts/sdxl_train_network.py" --bucket_no_upscale --bucket_reso_steps=32 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --gradient_checkpointing --keep_tokens="1" --learning_rate="1.0" --logging_dir="C:\Ai\me\processed\New folder\log1" --lr_scheduler="constant" --lr_scheduler_num_cycles="1" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="1024,1024" --max_token_length=150 --max_train_steps="1250" --min_snr_gamma=5 --mixed_precision="bf16" --network_alpha="128" --network_dim=128 --network_module=networks.lora --no_half_vae --optimizer_args weight_decay=0.4 decouple=True d0=0.00000033 use_bias_correction=True safeguard_warmup=True --optimizer_type="Prodigy" --output_dir="Z:\output-model-kohya" --output_name="xxx-XL-v0.9992-animagine" --pretrained_model_name_or_path="J:/ai/stable-diffusion-webui/models/Stable-diffusion/animagine-xl-3.1.safetensors" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="bf16" --scale_weight_norms="4" --seed="9752758" --text_encoder_lr=1.0 --train_batch_size="3" --training_comment="trigger: the queen of heart 1a" --train_data_dir="C:\Ai\me\processed\New folder\images_noidentifier" --unet_lr=1.0 --xformers --sample_sampler=euler_a --sample_prompts="Z:\output-model-kohya\sample\prompt.txt" --sample_every_n_epochs=1

5KilosOfCheese commented 5 months ago

I am having the same issue, it sometimes outputs good images but mostly the images are completely garbled.

here are the training settings I used kohya_ss/sd-scripts/sdxl_train_network.py" --bucket_no_upscale --bucket_reso_steps=32 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --gradient_checkpointing --keep_tokens="1" --learning_rate="1.0" --logging_dir="C:\Ai\me\processed\New folder\log1" --lr_scheduler="constant" --lr_scheduler_num_cycles="1" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="1024,1024" --max_token_length=150 --max_train_steps="1250" --min_snr_gamma=5 --mixed_precision="bf16" --network_alpha="128" --network_dim=128 --network_module=networks.lora --no_half_vae --optimizer_args weight_decay=0.4 decouple=True d0=0.00000033 use_bias_correction=True safeguard_warmup=True --optimizer_type="Prodigy" --output_dir="Z:\output-model-kohya" --output_name="xxx-XL-v0.9992-animagine" --pretrained_model_name_or_path="J:/ai/stable-diffusion-webui/models/Stable-diffusion/animagine-xl-3.1.safetensors" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="bf16" --scale_weight_norms="4" --seed="9752758" --text_encoder_lr=1.0 --train_batch_size="3" --training_comment="trigger: the queen of heart 1a" --train_data_dir="C:\Ai\me\processed\New folder\images_noidentifier" --unet_lr=1.0 --xformers --sample_sampler=euler_a --sample_prompts="Z:\output-model-kohya\sample\prompt.txt" --sample_every_n_epochs=1

Try changing keep_tokens in to 2. And what are the Optimizer extra arguments you are giving to prodigy. Prodigy doesn't work straight out of box. Also try training in FP16 since BG16 can sometimes just make garbage. Another thing to keep in mind that all models are not created equal for training. Models with more merging and finetuning don't work as well. So if you can't get that specific one to work, try going back to plain old SDXL and see if issue persist.

Also drop your Dimensions and Alpha WAY down to begin with. I start my training always from 4 and double it as needed. Currently it is almost impossible to tell whether something is wrong with training process or dataset or possibly both. Because the LORA can take in so much information.

If you want further help this list I made still applies:

  1. The training settings you used (preferably the "print training command" output, by just copy pasting it.
  2. General overview of the samples, the generated images and your dataset (this is important because otherwise we can't know how it went wrong) and if you used captions the just a example caption.
  3. The settings you generated the images with
  4. Example of the Lora's behavior by generating one seed with strenght of 1, 0.75, 0.5, 0,25, and without LoRA.

To get an idea of the issue, we need to see the issue. LoRA settings need to be altered based on what you are training. I do 4-5 versions before I get an idea what it is I want and need to do, then usually the 2nd of the final attempts get me what I want. Generally if I can't make it in 5-6 attempts, I scrap it and start from the begging by figuring out what is wrong with the dataset. But there are ALWAYS signs in the outputs which gives an idea about what went wrong.