ikimiya commented 1 week ago

First Training Test

Some base training parameters, Optimizer Came, LR Cosine, 0.0001 all LR, Min SNR Gamma Full FP16

From this post I save the state

https://github.com/derrian-distro/LoRA_Easy_Training_Scripts/issues/198#issuecomment-2045344768 You use this options to save the state of the training And then when you want to resume you have to input the folder where the state was saved here

My settings

Since I trained 5 epoch i have five states

I then tested the these lora creating an image, which then generates a image with the 00001 and the Final Epoch Training

Testing Resume the problem

Note nothing change in the settings, all I did was changed output name and the output folder and ticked the resume state
Training from state 00001, the first Epoch

Image has loaded in the command line

After running I got the lora files

Testing the Resume Training Lora

I have tested with both resuming at the nonResumeTest-000001-state and the nonResumeTest-state

which both of them will produce images that are fully black from any weights of the lora. Only the first lora created (the one that is not using the resume state has normal lora)

I have no idea was going on or what but I would like a way to know how to use the resume state so I can stop training whenever I want and resume whenever. Currently when I try to use it, any lora created using the state will be black lora and unusable.

Jelosus2 commented 1 week ago

In what base model are you training?

ikimiya commented 1 week ago

In what base model are you training?

It is the pony diffusionX6 start this one SDXL, with sdxl vae. I believe I had tried it with autismSDXL as well.

Jelosus2 commented 1 week ago

And what training network?

ikimiya commented 1 week ago

And what training network?

Idk what network args 8/4 so I just give me configs. TestingConfig.toml [[subsets]] caption_extension = ".txt" image_dir = "D:/TestResumeTraining/test" keep_tokens = 1 name = "test" num_repeats = 2 shuffle_caption = true

[train_mode] train_mode = "lora"

[general_args.args] max_data_loader_n_workers = 1 persistent_data_loader_workers = true pretrained_model_name_or_path = "F:/stable-diffusion-webui/models/Stable-diffusion/PonyXL/ponyDiffusionV6XL_v6StartWithThisOne.safetensors" vae = "F:/stable-diffusion-webui/models/VAE/sdxl_vae.safetensors" sdxl = true no_half_vae = true full_fp16 = true mixed_precision = "fp16" gradient_checkpointing = true gradient_accumulation_steps = 2 seed = 16 max_token_length = 225 prior_loss_weight = 1.0 xformers = true cache_latents = true max_train_epochs = 5

[general_args.dataset_args] resolution = 1024 batch_size = 2

[network_args.args] network_dim = 8 network_alpha = 4.0 min_timestep = 0 max_timestep = 1000

[optimizer_args.args] optimizer_type = "Came" lr_scheduler = "cosine" loss_type = "huber" huber_schedule = "snr" huber_c = 0.1 learning_rate = 0.0001 unet_lr = 0.0001 text_encoder_lr = 1e-6 max_grad_norm = 1.0 min_snr_gamma = 8

[saving_args.args] save_precision = "fp16" save_model_as = "safetensors" save_every_n_epochs = 1 save_state = true output_dir = "D:/TestResumeTraining/NonResume" output_name = "nonResumeTest"

[bucket_args.dataset_args] enable_bucket = true bucket_no_upscale = true min_bucket_reso = 256 max_bucket_reso = 4096 bucket_reso_steps = 64

[network_args.args.network_args]

[optimizer_args.args.optimizer_args] weight_decay = "0.1"

------ end For the retrained configs I only changed saving location, output file name, and check [x] resume save state to the folder loaction.

Jelosus2 commented 1 week ago

Weird, I don't see anything wrong and for me it works for pony

ikimiya commented 1 week ago

Yeah, I don’t know what’s wrong with it either or that is it only me problem.

derrian-distro commented 1 week ago

I think I see the issue, the vae you are using is the one without the fp16 fix, and you are training in full fp16, it's completely broken. try and grab the vae from here and use that to train with

ikimiya commented 1 week ago

I think I see the issue, the vae you are using is the one without the fp16 fix, and you are training in full fp16, it's completely broken.

try and grab the vae from here and use that to train with

Okay I'll try it and let you know if it fix.

ikimiya commented 1 week ago

I think I see the issue, the vae you are using is the one without the fp16 fix, and you are training in full fp16, it's completely broken. try and grab the vae from here and use that to train with

I have downloaded the sdxl16vae and tested training with same settings as before FullFP etc. [total 5 dataset]

result is the same black screen

Then I tested again without FULL FP using sdxl 16

result: Image Lora worked as it resume state

Then I tested with a previous train lora (15 epoch) that was trained with fullfp [total 8 dataset]

I turned off full fp and had settings to training precision fp16, using regular sdxl vae
resume training and it work when tested

Then I tested a lora I trained couple months ago [221 image dataset] trained on FULLFP regular vae.

Turned off FullFP and had training precision on fp16, network 16 alpha 8
causes an no memory error
Then I tried again on a different one [65 dataset] full fp trained, regular vae
I turned off FULLFP and had training precision on fp16, with network 8 alpha 4
it crashes when at a certain point

I remember that I have tried training with No Full FP16 or FullBF16 but using Training Precision bf16

This always cause it to have the # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 120.00 MiB. GPU Error

Idk why I can only run Full FP16 without crashing on higher datasets but training on either fp16 or bf16 on more datasets I crash either on the first epoch or just after I save the first epoch like (1/15). It seems like Full FP16 is broken as you said but without FullFP16 it doesn't seem like I can train images that's over 50 I think

I have a 3080 10vram if that helps

Jelosus2 commented 1 week ago

Why dont you train on Full BF16 then?

ikimiya commented 1 week ago

Why dont you train on Full BF16 then?

I'd never really tried using Full BF16, but it seems like swapping to Full BP16 worked when I resume training of the last epoch of the [65 dataset, Full FP16, regular Vae].

Tested it on forge using last epoch and the newly trained that's saved as fp16. Generated it well. Thanks! :3 I guess full fp16 is broken.

Jelosus2 commented 1 week ago

Is not broken, is the way it behaves. I found that some optimizers break when using fp16. Bf16 is better anyways, so unless you have an old card that doesn't support bf16 there's no reason to use fp16.

ikimiya commented 1 week ago

Is not broken, is the way it behaves. I found that some optimizers break when using fp16. Bf16 is better anyways, so unless you have an old card that doesn't support bf16 there's no reason to use fp16.

Ah I didn't know, I just found a settings and always had used it.

derrian-distro / LoRA_Easy_Training_Scripts

Lora trained with resume training (State) will always be black #244

First Training Test

My settings

Testing Resume the problem

Training from state 00001, the first Epoch

Testing the Resume Training Lora

I have downloaded the sdxl16vae and tested training with same settings as before FullFP etc. [total 5 dataset]

Then I tested again without FULL FP using sdxl 16

Then I tested with a previous train lora (15 epoch) that was trained with fullfp [total 8 dataset]

Then I tested a lora I trained couple months ago [221 image dataset] trained on FULLFP regular vae.

Then I tried again on a different one [65 dataset] full fp trained, regular vae