Closed ikimiya closed 1 week ago
In what base model are you training?
In what base model are you training?
It is the pony diffusionX6 start this one SDXL, with sdxl vae. I believe I had tried it with autismSDXL as well.
And what training network?
And what training network?
Idk what network args 8/4 so I just give me configs. TestingConfig.toml [[subsets]] caption_extension = ".txt" image_dir = "D:/TestResumeTraining/test" keep_tokens = 1 name = "test" num_repeats = 2 shuffle_caption = true
[train_mode] train_mode = "lora"
[general_args.args] max_data_loader_n_workers = 1 persistent_data_loader_workers = true pretrained_model_name_or_path = "F:/stable-diffusion-webui/models/Stable-diffusion/PonyXL/ponyDiffusionV6XL_v6StartWithThisOne.safetensors" vae = "F:/stable-diffusion-webui/models/VAE/sdxl_vae.safetensors" sdxl = true no_half_vae = true full_fp16 = true mixed_precision = "fp16" gradient_checkpointing = true gradient_accumulation_steps = 2 seed = 16 max_token_length = 225 prior_loss_weight = 1.0 xformers = true cache_latents = true max_train_epochs = 5
[general_args.dataset_args] resolution = 1024 batch_size = 2
[network_args.args] network_dim = 8 network_alpha = 4.0 min_timestep = 0 max_timestep = 1000
[optimizer_args.args] optimizer_type = "Came" lr_scheduler = "cosine" loss_type = "huber" huber_schedule = "snr" huber_c = 0.1 learning_rate = 0.0001 unet_lr = 0.0001 text_encoder_lr = 1e-6 max_grad_norm = 1.0 min_snr_gamma = 8
[saving_args.args] save_precision = "fp16" save_model_as = "safetensors" save_every_n_epochs = 1 save_state = true output_dir = "D:/TestResumeTraining/NonResume" output_name = "nonResumeTest"
[bucket_args.dataset_args] enable_bucket = true bucket_no_upscale = true min_bucket_reso = 256 max_bucket_reso = 4096 bucket_reso_steps = 64
[network_args.args.network_args]
[optimizer_args.args.optimizer_args] weight_decay = "0.1"
------ end For the retrained configs I only changed saving location, output file name, and check [x] resume save state to the folder loaction.
Weird, I don't see anything wrong and for me it works for pony
Yeah, I don’t know what’s wrong with it either or that is it only me problem.
I think I see the issue, the vae you are using is the one without the fp16 fix, and you are training in full fp16, it's completely broken. try and grab the vae from here and use that to train with
I think I see the issue, the vae you are using is the one without the fp16 fix, and you are training in full fp16, it's completely broken.
try and grab the vae from here and use that to train with
Okay I'll try it and let you know if it fix.
I think I see the issue, the vae you are using is the one without the fp16 fix, and you are training in full fp16, it's completely broken. try and grab the vae from here and use that to train with
I remember that I have tried training with No Full FP16 or FullBF16 but using Training Precision bf16
Idk why I can only run Full FP16 without crashing on higher datasets but training on either fp16 or bf16 on more datasets I crash either on the first epoch or just after I save the first epoch like (1/15). It seems like Full FP16 is broken as you said but without FullFP16 it doesn't seem like I can train images that's over 50 I think
I have a 3080 10vram if that helps
Why dont you train on Full BF16 then?
Why dont you train on Full BF16 then?
I'd never really tried using Full BF16, but it seems like swapping to Full BP16 worked when I resume training of the last epoch of the [65 dataset, Full FP16, regular Vae].
Is not broken, is the way it behaves. I found that some optimizers break when using fp16. Bf16 is better anyways, so unless you have an old card that doesn't support bf16 there's no reason to use fp16.
Is not broken, is the way it behaves. I found that some optimizers break when using fp16. Bf16 is better anyways, so unless you have an old card that doesn't support bf16 there's no reason to use fp16.
Ah I didn't know, I just found a settings and always had used it.
First Training Test
Some base training parameters, Optimizer Came, LR Cosine, 0.0001 all LR, Min SNR Gamma Full FP16
From this post I save the state
My settings
Since I trained 5 epoch i have five states
I then tested the these lora creating an image, which then generates a image with the 00001 and the Final Epoch Training
Testing Resume the problem
Training from state 00001, the first Epoch
Image has loaded in the command line
After running I got the lora files
Testing the Resume Training Lora
I have tested with both resuming at the nonResumeTest-000001-state and the nonResumeTest-state
I have no idea was going on or what but I would like a way to know how to use the resume state so I can stop training whenever I want and resume whenever. Currently when I try to use it, any lora created using the state will be black lora and unusable.