cocktailpeanut / fluxgym

Dead simple FLUX LoRA training UI with LOW VRAM support
MIT License
1.37k stars 113 forks source link

Weird Behavior in LORA Training #226

Open ircrp opened 3 weeks ago

ircrp commented 3 weeks ago

I’m encountering some strange behavior while training a LoRA model using FluxGym, and I’m curious if anyone else has seen something similar. In my training process, I generated samples at intervals (steps 250, 500, and 750) to check the model’s progression, and I’ve attached an image that illustrates this. Here’s the setup I used for the training prompts:

ADI4 as fireman --d 999

ADI4 as software engineer --d 999

ADI4 as president --d 999

ADI4 as teacher --d 999

At step 500, the outputs generally align with the specific professions in the prompts, showing elements unique to each role, like a suit for "president" or firefighter gear. However, by step 750, things get weird. All the generations start looking more like each other, particularly showing characteristics of the teacher role, especially the traditional clothing. This feels almost like the training is somehow being "overwritten" by previous samples.

Here’s a rough timeline of what I noticed:

Step 250: The generated samples are somewhat unique to each prompt but still rough.

Step 500: Outputs become clearer and some even begin aligning more closely with the trained man character

Step 750: Almost all samples look strikingly similar, with many reflecting the traditional attire seen in the "teacher" role from step 500 sample—even for prompts like "fireman" and "president," which shouldn’t be the case.

image

Questions: Has anyone experienced this type of "style bleed" before? Could it be that prior generated samples are somehow influencing the current ones?

Is there a known issue in FluxGym where training appears to converge too strongly toward one class or style over iterations?

Any suggestions on preventing this kind of merging of styles as training progresses?

Train Script:

accelerate launch \
--mixed_precision bf16 \
--num_cpu_threads_per_process 1 \
sd-scripts/flux_train_network.py \
--pretrained_model_name_or_path "/home/me/fluxgym/models/unet/flux1-dev.sft" \
--clip_l "/home/me/fluxgym/models/clip/clip_l.safetensors" \
--t5xxl "/home/me/fluxgym/models/clip/t5xxl_fp16.safetensors" \
--ae "/home/me/fluxgym/models/vae/ae.sft" \
--cache_latents_to_disk \
--save_model_as safetensors \
--sdpa --persistent_data_loader_workers \
--max_data_loader_n_workers 2 \
--seed 42 \
--gradient_checkpointing \
--mixed_precision bf16 \
--save_precision bf16 \
--network_module networks.lora_flux \
--network_dim 4 \
--optimizer_type adamw8bit \--sample_prompts="/home/me/fluxgym/outputs/adi4/sample_prompts.txt" --sample_every_n_steps="250" \
--learning_rate 8e-4 \
--cache_text_encoder_outputs \
--cache_text_encoder_outputs_to_disk \
--fp8_base \
--highvram \
--max_train_epochs 16 \
--save_every_n_epochs 4 \
--dataset_config "/home/me/fluxgym/outputs/adi4/dataset.toml" \
--output_dir "/home/me/fluxgym/outputs/adi4" \
--output_name adi4 \
--timestep_sampling shift \
--discrete_flow_shift 3.1582 \
--model_prediction_type raw \
--guidance_scale 1 \
--loss_type l2 \

Train Config:

[general]
shuffle_caption = false
caption_extension = '.txt'
keep_tokens = 1

[[datasets]]
resolution = 512
batch_size = 1
keep_tokens = 1
[[datasets.subsets]]
image_dir = '/home/me/fluxgym/datasets/adi4'
class_tokens = 'ADI4'
num_repeats = 10
endege commented 2 weeks ago

That is not weird and it shouldn't have anything to do with fluxgym as fluxgym is just a wrapper for kohya sd-scripts.

What you are experiencing is most likely over training. If you over train a lora it will start to spit out similar images. Best case scenario, use the lora from step 500. If you want to have better results with your lora, you can increase the number of sample images and/or repeat trains per image. Usually, I would go with more images.