Can't get prodigy to train

MOGRAINEREPORTS commented 3 weeks ago

Here the settings. I have no error, it's just not training. 1 epoch or 20 epochs, the generation isn't affected one bit by the lora. If i send the config to regular koyah it does train, is it my error or is there something going on ?

[[subsets]]
caption_extension = ".txt"
image_dir = "K:/SDXL/LORA/img/quxnn"
keep_tokens = 1
name = "quxnn"
num_repeats = 15
shuffle_caption = true

[train_mode]
train_mode = "lora"

[general_args.args]
max_data_loader_n_workers = 1
persistent_data_loader_workers = true
sdxl = true
full_fp16 = true
mixed_precision = "fp16"
gradient_checkpointing = true
seed = 69420
max_token_length = 225
prior_loss_weight = 1.0
xformers = true
max_train_epochs = 10
cache_latents = true
training_comment = "PROD_V1"

[general_args.dataset_args]
# source are all 768x768
resolution = 768 
batch_size = 1

[network_args.args]
network_alpha = 64.0
min_timestep = 0
max_timestep = 1000
network_dim = 64
fa = true

[optimizer_args.args]
loss_type = "l2"
max_grad_norm = 1.0
optimizer_type = "Prodigy"
learning_rate = 1.0
unet_lr = 1.0
text_encoder_lr = 1.0
lr_scheduler = "cosine"
warmup_ratio = 0.02
min_snr_gamma = 5

[saving_args.args]
save_precision = "fp16"
save_model_as = "safetensors"
save_every_n_epochs = 1

[noise_args.args]
multires_noise_iterations = 6
multires_noise_discount = 0.3

[extra_args.args]
decouple = "True"
weight_decay = "0.01"
d_coef = "0.8"
use_bias_correction = "True"
safeguard_warmup = "True"
betas = "0.9,0.99"

[bucket_args.dataset_args]
enable_bucket = true
min_bucket_reso = 256
max_bucket_reso = 1024
bucket_reso_steps = 64

[network_args.args.network_args]

[optimizer_args.args.optimizer_args]
weight_decay = "0.1"
betas = "0.9,0.99"

Jelosus2 commented 3 weeks ago

Do you train on SD 1.5?

MOGRAINEREPORTS commented 3 weeks ago

Do you train on SD 1.5?

sdxl = true

no, sdxl.

I've got adamW8bit full fp16 working, adamW full bf16 working but i can't figure out the adaptive ones, there must be something i'm doing wrong but This is the whole config so it's in there if I am at fault here lol

Jelosus2 commented 3 weeks ago

Prodigy doesn't perform good with fp16. For SDXL I recommend Came as optimizer and Rex (Rawr) as scheduler. The lr depends, I train characters in PonyDiffusion which is still SDXL and I use 1e-4 for unet lr and 1e-6 for TE lr and minimum lr. Also just fyi, in sdxl you want to train at 1024 resolution and if your GPU bf16 then use full bf16.

MOGRAINEREPORTS commented 3 weeks ago

Prodigy doesn't perform good with fp16. For SDXL I recommend Came as optimizer and Rex (Rawr) as scheduler. The lr depends, I train characters in PonyDiffusion which is still SDXL and I use 1e-4 for unet lr and 1e-6 for TE lr and minimum lr. Also just fyi, in sdxl you want to train at 1024 resolution and if your GPU bf16 then use full bf16.

I guess I'm behind in the optimizer/scheduler game, to me, prodigy is still the fresh new thing off the grill that's poppin, i guess not.

Never heard about either of your suggestions but i will definitely try it... I'd love if you could share your whole config , ill try something out myself meanwhile, thanks!

Jelosus2 commented 3 weeks ago

Depends on what you want to train, but sure I'll share it on a few hours.

MOGRAINEREPORTS commented 3 weeks ago

Depends on what you want to train, but sure I'll share it on a few hours.

Realistic likeness mostly, let me know! :D

Jelosus2 commented 3 weeks ago

This is what I use to train characters:

[[subsets]]
caption_extension = ".txt"
image_dir = "E:/Training_Loras/Leora/dataseter"
keep_tokens = 1
name = "cherry"
num_repeats = 1
shuffle_caption = true

[train_mode]
train_mode = "lora"

[general_args.args]
max_data_loader_n_workers = 1
persistent_data_loader_workers = true
pretrained_model_name_or_path = "C:/Users/User/Documents/SwarmUI/Models/Stable-Diffusion/ponyDiffusionV6XL_v6StartWithThisOne.safetensors"
vae = "C:/Users/User/Documents/SwarmUI/Models/VAE/sdxl_vae.safetensors"
sdxl = true
no_half_vae = true
full_bf16 = true
mixed_precision = "bf16"
gradient_checkpointing = true
seed = 69
max_token_length = 225
prior_loss_weight = 1.0
sdpa = true
max_train_epochs = 10
cache_latents = true

[general_args.dataset_args]
resolution = 1024
batch_size = 4

[network_args.args]
network_dim = 8
network_alpha = 4.0
min_timestep = 0
max_timestep = 1000

[optimizer_args.args]
lr_scheduler = "cosine"
optimizer_type = "Came"
lr_scheduler_type = "LoraEasyCustomOptimizer.RexAnnealingWarmRestarts.RexAnnealingWarmRestarts"
lr_scheduler_num_cycles = 1
loss_type = "l2"
learning_rate = 0.0001
warmup_ratio = 0.05
unet_lr = 0.0001
text_encoder_lr = 1e-6
max_grad_norm = 1.0
min_snr_gamma = 5

[saving_args.args]
output_dir = "E:/Training_Loras/Leora/model"
output_name = "Leora-JeloXL"
save_precision = "fp16"
save_model_as = "safetensors"
save_every_n_epochs = 1
save_toml = true
save_toml_location = "E:/Training_Loras/Leora/model"

[logging_args.args]
logging_dir = "E:/Training_Loras/Leora/tensorboard_logging"
log_prefix = "leora-"
log_with = "tensorboard"

[bucket_args.dataset_args]
enable_bucket = true
min_bucket_reso = 256
max_bucket_reso = 4096
bucket_reso_steps = 64

[network_args.args.network_args]

[optimizer_args.args.lr_scheduler_args]
min_lr = 1e-6
gamma = 0.9

[optimizer_args.args.optimizer_args]
weight_decay = "0.04"

MOGRAINEREPORTS commented 3 weeks ago

This is what I use to train characters:

[[subsets]]
caption_extension = ".txt"
image_dir = "E:/Training_Loras/Leora/dataseter"
keep_tokens = 1
name = "cherry"
num_repeats = 1
shuffle_caption = true

[train_mode]
train_mode = "lora"

[general_args.args]
max_data_loader_n_workers = 1
persistent_data_loader_workers = true
pretrained_model_name_or_path = "C:/Users/User/Documents/SwarmUI/Models/Stable-Diffusion/ponyDiffusionV6XL_v6StartWithThisOne.safetensors"
vae = "C:/Users/User/Documents/SwarmUI/Models/VAE/sdxl_vae.safetensors"
sdxl = true
no_half_vae = true
full_bf16 = true
mixed_precision = "bf16"
gradient_checkpointing = true
seed = 69
max_token_length = 225
prior_loss_weight = 1.0
sdpa = true
max_train_epochs = 10
cache_latents = true

[general_args.dataset_args]
resolution = 1024
batch_size = 4

[network_args.args]
network_dim = 8
network_alpha = 4.0
min_timestep = 0
max_timestep = 1000

[optimizer_args.args]
lr_scheduler = "cosine"
optimizer_type = "Came"
lr_scheduler_type = "LoraEasyCustomOptimizer.RexAnnealingWarmRestarts.RexAnnealingWarmRestarts"
lr_scheduler_num_cycles = 1
loss_type = "l2"
learning_rate = 0.0001
warmup_ratio = 0.05
unet_lr = 0.0001
text_encoder_lr = 1e-6
max_grad_norm = 1.0
min_snr_gamma = 5

[saving_args.args]
output_dir = "E:/Training_Loras/Leora/model"
output_name = "Leora-JeloXL"
save_precision = "fp16"
save_model_as = "safetensors"
save_every_n_epochs = 1
save_toml = true
save_toml_location = "E:/Training_Loras/Leora/model"

[logging_args.args]
logging_dir = "E:/Training_Loras/Leora/tensorboard_logging"
log_prefix = "leora-"
log_with = "tensorboard"

[bucket_args.dataset_args]
enable_bucket = true
min_bucket_reso = 256
max_bucket_reso = 4096
bucket_reso_steps = 64

[network_args.args.network_args]

[optimizer_args.args.lr_scheduler_args]
min_lr = 1e-6
gamma = 0.9

[optimizer_args.args.optimizer_args]
weight_decay = "0.04"

amazing, thanks! its working but its really rough, for the model im using anyways, seems to overfit quick before training properly

I'ma tune it but it does work pretty good. PS: if i can ask you more questions about it on discord or something that'd be amazing, LMK

As a note though, i'd really like to be using prodigy or any good automated, set lr to 1 and forget it optimizer with this lora trainer if anyone has a solution....

Can't get any to work

Jelosus2 commented 3 weeks ago

Sure you can add me to discord and prodigy is not good with big datasets. I will close the issue now.

derrian-distro / LoRA_Easy_Training_Scripts

Can't get prodigy to train #229