derrian-distro / LoRA_Easy_Training_Scripts

A UI made in Pyside6 to make training LoRA/LoCon and other LoRA type models in sd-scripts easy
GNU General Public License v3.0
1.06k stars 103 forks source link

Lora trained with resume training (State) will always be black #244

Closed ikimiya closed 1 week ago

ikimiya commented 1 week ago

First Training Test

Some base training parameters, Optimizer Came, LR Cosine, 0.0001 all LR, Min SNR Gamma Full FP16

From this post I save the state

https://github.com/derrian-distro/LoRA_Easy_Training_Scripts/issues/198#issuecomment-2045344768 You use this options to save the state of the training Screenshot_3 And then when you want to resume you have to input the folder where the state was saved here Screenshot_4

My settings

image Since I trained 5 epoch i have five states image

I then tested the these lora creating an image, which then generates a image with the 00001 and the Final Epoch Training

Testing Resume the problem

Image has loaded in the command line image

After running I got the lora files

image

Testing the Resume Training Lora

image

I have tested with both resuming at the nonResumeTest-000001-state and the nonResumeTest-state

I have no idea was going on or what but I would like a way to know how to use the resume state so I can stop training whenever I want and resume whenever. Currently when I try to use it, any lora created using the state will be black lora and unusable.

Jelosus2 commented 1 week ago

In what base model are you training?

ikimiya commented 1 week ago

In what base model are you training?

It is the pony diffusionX6 start this one SDXL, with sdxl vae. I believe I had tried it with autismSDXL as well.

Jelosus2 commented 1 week ago

And what training network?

ikimiya commented 1 week ago

And what training network?

Idk what network args 8/4 so I just give me configs. TestingConfig.toml [[subsets]] caption_extension = ".txt" image_dir = "D:/TestResumeTraining/test" keep_tokens = 1 name = "test" num_repeats = 2 shuffle_caption = true

[train_mode] train_mode = "lora"

[general_args.args] max_data_loader_n_workers = 1 persistent_data_loader_workers = true pretrained_model_name_or_path = "F:/stable-diffusion-webui/models/Stable-diffusion/PonyXL/ponyDiffusionV6XL_v6StartWithThisOne.safetensors" vae = "F:/stable-diffusion-webui/models/VAE/sdxl_vae.safetensors" sdxl = true no_half_vae = true full_fp16 = true mixed_precision = "fp16" gradient_checkpointing = true gradient_accumulation_steps = 2 seed = 16 max_token_length = 225 prior_loss_weight = 1.0 xformers = true cache_latents = true max_train_epochs = 5

[general_args.dataset_args] resolution = 1024 batch_size = 2

[network_args.args] network_dim = 8 network_alpha = 4.0 min_timestep = 0 max_timestep = 1000

[optimizer_args.args] optimizer_type = "Came" lr_scheduler = "cosine" loss_type = "huber" huber_schedule = "snr" huber_c = 0.1 learning_rate = 0.0001 unet_lr = 0.0001 text_encoder_lr = 1e-6 max_grad_norm = 1.0 min_snr_gamma = 8

[saving_args.args] save_precision = "fp16" save_model_as = "safetensors" save_every_n_epochs = 1 save_state = true output_dir = "D:/TestResumeTraining/NonResume" output_name = "nonResumeTest"

[bucket_args.dataset_args] enable_bucket = true bucket_no_upscale = true min_bucket_reso = 256 max_bucket_reso = 4096 bucket_reso_steps = 64

[network_args.args.network_args]

[optimizer_args.args.optimizer_args] weight_decay = "0.1"

------ end For the retrained configs I only changed saving location, output file name, and check [x] resume save state to the folder loaction.

Jelosus2 commented 1 week ago

Weird, I don't see anything wrong and for me it works for pony

ikimiya commented 1 week ago

Yeah, I don’t know what’s wrong with it either or that is it only me problem.

derrian-distro commented 1 week ago

I think I see the issue, the vae you are using is the one without the fp16 fix, and you are training in full fp16, it's completely broken. try and grab the vae from here and use that to train with

ikimiya commented 1 week ago

I think I see the issue, the vae you are using is the one without the fp16 fix, and you are training in full fp16, it's completely broken.

try and grab the vae from here and use that to train with

Okay I'll try it and let you know if it fix.

ikimiya commented 1 week ago

I think I see the issue, the vae you are using is the one without the fp16 fix, and you are training in full fp16, it's completely broken. try and grab the vae from here and use that to train with

I have downloaded the sdxl16vae and tested training with same settings as before FullFP etc. [total 5 dataset]

Then I tested again without FULL FP using sdxl 16

Then I tested with a previous train lora (15 epoch) that was trained with fullfp [total 8 dataset]

Then I tested a lora I trained couple months ago [221 image dataset] trained on FULLFP regular vae.

I remember that I have tried training with No Full FP16 or FullBF16 but using Training Precision bf16

Idk why I can only run Full FP16 without crashing on higher datasets but training on either fp16 or bf16 on more datasets I crash either on the first epoch or just after I save the first epoch like (1/15). It seems like Full FP16 is broken as you said but without FullFP16 it doesn't seem like I can train images that's over 50 I think

I have a 3080 10vram if that helps

Jelosus2 commented 1 week ago

Why dont you train on Full BF16 then?

ikimiya commented 1 week ago

Why dont you train on Full BF16 then?

I'd never really tried using Full BF16, but it seems like swapping to Full BP16 worked when I resume training of the last epoch of the [65 dataset, Full FP16, regular Vae].

Jelosus2 commented 1 week ago

Is not broken, is the way it behaves. I found that some optimizers break when using fp16. Bf16 is better anyways, so unless you have an old card that doesn't support bf16 there's no reason to use fp16.

ikimiya commented 1 week ago

Is not broken, is the way it behaves. I found that some optimizers break when using fp16. Bf16 is better anyways, so unless you have an old card that doesn't support bf16 there's no reason to use fp16.

Ah I didn't know, I just found a settings and always had used it.