cannot get a successful training on flux. its totally me. what am I doing wrong?

Melbee83 commented 2 weeks ago

ok, so please forgive me if this is the wrong place, i just dont know where else to post. back in the realm of SD1.5, i found this program by wanting to make single image loras. eventually i settled on some great settings and its been working real good every since. this was back when everyone said it cannot be done. it not only could be done, it worked really, really good for what it was. over the time, ive updated to use it on SD2, and XL, and now flux. i downloaded the new flux branch, and keep it up to date. I cannot seem to get a successful training, and since the training is accumulative, it no longer takes 3 minutes to train 150 runs, and later it was longer as I increased to 1024x1024 for XL, now it takes 35 or so for flux. totally fine, except im missing something, i cant keep trail an erroring until I get it. its taking too long, so I come here and see if you all can assist. ive been tweaking the settings like 20 times, and im clearly missing something.

now obviously, my needs are basic in the grand scheme of massive datasets, i dont caption but the one trigger word which typically isnt even needed, and i thought, ok , when I get a bigger dataset for a project, i can make minor edits and it will work great! ive had no problems up until now, but i cant seem to get this thing to hit a lick on flux1 dev fp8, which was the first version I found it would reliably train on. sure the precision is not good, but the technology is real new, i can get a better training later, on a better model.

it actually trains all 150, and writes the file, the file just doesent work, a real different scenario than my SDXL days. i tried to adapt all i know to make this be compatible with flux, im just missing something. with everyone moving to cloud everything, its not like there are instructions on these sites for the nitty gritty.

for reference, i load the toml, and select the directory inside my training folder and then give it an output name and output folder and image directory. the same process ive used all along.. i just cant seem to get a training. and im missing something stupid silly im sure. as I had very good results on every training until now. does anything jump out at you all? advice would be appreciated more than you know. as ive looked all over for settings others use, and i pick out the useful stuff, plug it in, just to have another unusable file.

thanks. :)

this is my toml.

[[subsets]] caption_extension = ".txt" image_dir = "F:/Training" name = "Training" num_repeats = 1

[train_mode] train_mode = "lora"

[general_args.args] max_data_loader_n_workers = 1 persistent_data_loader_workers = true vae = "F:/stable-diffusion-webui-forge/webui/models/VAE/ae.safetensors" clip_skip = 2 full_bf16 = true mixed_precision = "bf16" gradient_checkpointing = true seed = 23 max_token_length = 225 prior_loss_weight = 1.0 xformers = true max_train_epochs = 150 cache_latents = true pretrained_model_name_or_path = "F:/stable-diffusion-webui-forge/webui/models/Stable-diffusion/flux1DevFp8_v10.safetensors"

[general_args.dataset_args] resolution = [ 1024, 1024,] batch_size = 1

[network_args.args] network_dim = 64 network_alpha = 32.0 min_timestep = 0 max_timestep = 1000 network_train_unet_only = true cache_text_encoder_outputs = true

[optimizer_args.args] optimizer_type = "AdamW" lr_scheduler = "constant" loss_type = "l2" learning_rate = 0.0001 unet_lr = 0.0001 max_grad_norm = 1.0

[saving_args.args] output_dir = "F:/Training/" save_precision = "bf16" save_model_as = "safetensors"

[flux_args.args] ae = "F:/stable-diffusion-webui-forge/webui/models/VAE/ae.safetensors" clip_l = "F:/stable-diffusion-webui-forge/webui/models/text_encoder/clip_l.safetensors" t5xxl = "F:/stable-diffusion-webui-forge/webui/models/text_encoder/t5xxl_fp16.safetensors" t5xxl_max_token_length = 512 split_mode = true timestep_sampling = "sigma" discrete_flow_shift = false weighting_scheme = "none" guidance_scale = 3.5 model_prediction_type = "sigma_scaled"

[bucket_args.dataset_args] enable_bucket = true min_bucket_reso = 256 max_bucket_reso = 1024 bucket_reso_steps = 64

[network_args.args.network_args] train_blocks = "single"

[optimizer_args.args.optimizer_args] weight_decay = "0.1" betas = "0.9,0.99"

PheonixAi420 commented 2 weeks ago

What exactly are you trying to train? I have trained several Flux celebrity loras since I found this program and it works really well for me. I would suggest changing to AdamW 8 bit and Clip skip 1 unless you are training an anime character or style. Here is an example toml file that I know will work just change directories and vae etc. for what are your paths and you should be good.

[[subsets]] caption_extension = ".txt" image_dir = "C:/StableDiffusion/training/img/10_Aurolka" keep_tokens = 1 name = "subset 1" num_repeats = 10 shuffle_caption = true

[train_mode] train_mode = "lora"

[general_args.args] max_data_loader_n_workers = 1 persistent_data_loader_workers = true pretrained_model_name_or_path = "C:/StableDiffusion/stable-diffusion-webui/extensions/stable-diffusion-webui-forge/models/Stable-diffusion/flux1-dev.safetensors" vae = "C:/StableDiffusion/stable-diffusion-webui/extensions/stable-diffusion-webui-forge/models/text_encoder/t5xxl_fp16.safetensors" clip_skip = 1 no_half_vae = true highvram = true full_bf16 = true mixed_precision = "bf16" fp8_base = true gradient_checkpointing = true gradient_accumulation_steps = 1 seed = 23 max_token_length = 225 prior_loss_weight = 1.0 sdpa = true max_train_epochs = 4 cache_latents = true cache_latents_to_disk = true

[general_args.dataset_args] resolution = [ 1024, 1024,] batch_size = 1

[network_args.args] network_dim = 32 network_alpha = 16.0 min_timestep = 0 max_timestep = 1000

[optimizer_args.args] optimizer_type = "AdamW8bit" lr_scheduler = "constant" loss_type = "l2" learning_rate = 0.0003 unet_lr = 0.0003 text_encoder_lr = 0.0003 max_grad_norm = 1.0 min_snr_gamma = 5

[saving_args.args] output_dir = "C:/StableDiffusion/training/Model" save_precision = "bf16" save_model_as = "safetensors" save_every_n_epochs = 1 save_toml = true save_toml_location = "C:/StableDiffusion/training/Log" save_state = true output_name = "Aurolka"

[noise_args.args] noise_offset = 0.4

[sample_args.args] sample_sampler = "euler_a" sample_every_n_epochs = 1 sample_prompts = "C:/StableDiffusion/training/img/002-.txt"

[flux_args.args] ae = "C:/StableDiffusion/stable-diffusion-webui/extensions/stable-diffusion-webui-forge/models/VAE/ae.safetensors" clip_l = "C:/StableDiffusion/stable-diffusion-webui/extensions/stable-diffusion-webui-forge/models/text_encoder/clip_l.safetensors" t5xxl = "C:/StableDiffusion/stable-diffusion-webui/extensions/stable-diffusion-webui-forge/models/text_encoder/t5xxl_fp16.safetensors" apply_t5_attn_mask = true t5xxl_max_token_length = 512 split_mode = true timestep_sampling = "sigmoid" sigmoid_scale = 1.0 discrete_flow_shift = false weighting_scheme = "none" guidance_scale = 1.0 model_prediction_type = "raw"

[bucket_args.dataset_args] enable_bucket = true bucket_no_upscale = true min_bucket_reso = 256 max_bucket_reso = 1024 bucket_reso_steps = 64

[network_args.args.network_args] train_blocks = "single"

[optimizer_args.args.optimizer_args] weight_decay = "0.1"

Melbee83 commented 2 weeks ago

hello, and thank you very much for answering with your post. I looked through the settings in the screen and logically worked through what I wanted vs what you are doing and concluded the only 2 settings I wanted to change were adam -> adam 8 bit based on the model im forced to use for the time being, and setting the timestamping from sigma and sigma scaled to sigmoid and raw, and it worked without a hitch.

i thank you very much for your time and caring enough to post your toml for comparison. it was invaluable in seeing my glaring issue.

derrian-distro / LoRA_Easy_Training_Scripts

cannot get a successful training on flux. its totally me. what am I doing wrong? #243

this is my toml.