bmaltais / kohya_ss

Apache License 2.0
9.62k stars 1.24k forks source link

Is SD3 training available? #2616

Open furry123fdsare opened 4 months ago

furry123fdsare commented 4 months ago

Is SD3 training available?

protector131090 commented 4 months ago

Is SD3 training available?

Script is done bun still not available in kohya((

bmaltais commented 4 months ago

It is still in development at the moment. Once things stabalise and it get merged in the dev branch I will start work on the GUI to integrate it. 1st I will need to understand all the specific parameters for SD3 so I can create an SD3 specific GUI section for the model... so keep faith. One day it will become available... but I can't be faster than sd-scripts releases.

bmaltais commented 4 months ago

I did a comparison between the new sd3 trainer and sdxl to see how much has been caried over, and added. Here is the table so far:

Option SD3 SDXL Description
adaptive_noise_scale x x Add latent mean absolute value * this value to noise_offset
alpha_mask x x Use alpha channel as mask for training
async_upload x x Upload to huggingface asynchronously
block_lr x Learning rates for each block of U-Net, comma-separated, 23 values
bucket_no_upscale x x Make bucket for each image without upscaling
bucket_reso_steps x x Steps of resolution for buckets, divisible by 8 is recommended
cache_info x x Cache meta information (caption and image size) for faster dataset loading
cache_latents x x Cache latents to main memory to reduce VRAM usage
cache_latents_to_disk x x Cache latents to disk to reduce VRAM usage
cache_text_encoder_outputs x x Cache text encoder outputs
cache_text_encoder_outputs_to_disk x x Cache text encoder outputs to disk
caption_dropout_every_n_epochs x x Dropout all captions every N epochs
caption_dropout_rate x x Rate of dropout for captions (0.0~1.0)
caption_extension x x Extension of caption files
caption_extention x x Extension of caption files (backward compatibility)
caption_prefix x x Prefix for caption text
caption_separator x x Separator for caption
caption_suffix x x Suffix for caption text
caption_tag_dropout_rate x x Rate of dropout for comma-separated tokens (0.0~1.0)
clip_g x CLIP-G model path
clip_l x CLIP-L model path
clip_skip x x Use output of nth layer from back of text encoder (n>=1)
color_aug x x Enable weak color augmentation
conditioning_data_dir x x Conditioning data directory
config_file x x Config file for detail settings
console_log_file x x Log to a file instead of stderr
console_log_level x x Set the logging level
console_log_simple x x Simple log output
dataset_class x x Dataset class for arbitrary dataset
dataset_config x x Config file for detail settings
dataset_repeats x x Repeat dataset when training with captions
ddp_gradient_as_bucket_view x x Enable gradient_as_bucket_view for DDP
ddp_static_graph x x Enable static_graph for DDP
ddp_timeout x x DDP timeout (min, None for default of accelerate)
debiased_estimation_loss x x Use debiased estimation loss
debug_dataset x x Show images for debugging (do not train)
deepspeed x x Enable deepspeed training
diffusers_xformers x Use xformers by diffusers
disable_mmap_load_safetensors x x Disable mmap load for safetensors
dynamo_backend x x Dynamo backend type
enable_bucket x x Enable buckets for multi aspect ratio training
enable_wildcard x x Enable wildcard for caption
face_crop_aug_range x x Enable face-centered crop augmentation and its range
flip_aug x x Enable horizontal flip augmentation
fp16_master_weights_and_gradients x x Use fp16 master weights and gradients
fp8_base x x Use fp8 for base model
full_bf16 x x BF16 training including gradients
full_fp16 x x FP16 training including gradients
fused_backward_pass x x Combines backward pass and optimizer step to reduce VRAM usage
fused_optimizer_groups x x Number of optimizers for fused backward pass and optimizer step
gradient_accumulation_steps x x Number of updates steps to accumulate before performing a backward/update pass
gradient_checkpointing x x Enable gradient checkpointing
highvram x x Disable low VRAM optimization
huber_c x x The huber loss parameter
huber_schedule x x The scheduling method for Huber loss
huggingface_path_in_repo x x Huggingface model path to upload files
huggingface_repo_id x x Huggingface repo name to upload
huggingface_repo_type x x Huggingface repo type to upload
huggingface_repo_visibility x x Huggingface repository visibility
huggingface_token x x Huggingface token
in_json x x JSON metadata for dataset
ip_noise_gamma x x Enable input perturbation noise
ip_noise_gamma_random_strength x x Use random strength for input perturbation noise
keep_tokens x x Keep heading N tokens when shuffling caption tokens
keep_tokens_separator x x Custom separator to divide caption into fixed and flexible parts
learning_rate x x Learning rate
learning_rate_te1 x Learning rate for text encoder 1 (ViT-L)
learning_rate_te2 x Learning rate for text encoder 2 (BiG-G)
log_config x x Log training configuration
log_prefix x x Add prefix for each log directory
log_tracker_config x x Path to tracker config file to use for logging
log_tracker_name x x Name of tracker to use for logging
log_with x x What logging tool(s) to use
logging_dir x x Enable logging and output TensorBoard log to this directory
logit_mean x Mean to use for logit normal weighting scheme
logit_std x Standard deviation to use for logit normal weighting scheme
loss_type x x The type of loss function to use
lowram x x Enable low RAM optimization
lr_scheduler x x Scheduler to use for learning rate
lr_scheduler_args x x Additional arguments for scheduler
lr_scheduler_num_cycles x x Number of restarts for cosine scheduler with restarts
lr_scheduler_power x x Polynomial power for polynomial scheduler
lr_scheduler_type x x Custom scheduler module
lr_warmup_steps x x Number of steps for the warmup in the lr scheduler
masked_loss x x Apply mask for calculating loss
max_bucket_reso x x Maximum resolution for buckets
max_data_loader_n_workers x x Max num workers for DataLoader
max_grad_norm x x Max gradient norm, 0 for no clipping
max_timestep x x Set maximum time step for U-Net training
max_token_length x x Max token length of text encoder
max_train_epochs x x Training epochs (overrides max_train_steps)
max_train_steps x x Training steps
mem_eff_attn x x Use memory efficient attention for CrossAttention
metadata_author x x Author name for model metadata
metadata_description x x Description for model metadata
metadata_license x x License for model metadata
metadata_tags x x Tags for model metadata, separated by comma
metadata_title x x Title for model metadata
min_bucket_reso x x Minimum resolution for buckets
min_snr_gamma x x Gamma for reducing the weight of high loss timesteps
min_timestep x x Set minimum time step for U-Net training
mixed_precision x x Use mixed precision
mode_scale x Scale of mode weighting scheme
multires_noise_discount x x Set discount value for multires noise
multires_noise_iterations x x Enable multires noise with this number of iterations
no_half_vae x Do not use fp16/bf16 VAE in mixed precision
noise_offset x x Enable noise offset with this value
noise_offset_random_strength x x Use random strength for noise offset
offload_optimizer_device x x Device for offloading optimizer
offload_optimizer_nvme_path x x NVMe path for offloading optimizer
offload_param_device x x Device for offloading parameters
offload_param_nvme_path x x NVMe path for offloading parameters
optimizer_args x x Additional arguments for optimizer
optimizer_type x x Optimizer to use
output_config x x Output command line args to given .toml file
output_dir x x Directory to output trained model
output_name x x Base name of trained model file
persistent_data_loader_workers x x Persistent DataLoader workers
pretrained_model_name_or_path x x Pretrained model to train, directory to Diffusers model or StableDiffusion checkpoint
random_crop x x Enable random crop
reg_data_dir x x Directory for regularization images
resolution x x Resolution in training
resume x x Saved state to resume training
resume_from_huggingface x x Resume from huggingface
sample_at_first x x Generate sample images before training
sample_every_n_epochs x x Generate sample images every N epochs
sample_every_n_steps x x Generate sample images every N steps
sample_prompts x x File for prompts to generate sample images
sample_sampler x x Sampler (scheduler) type for sample images
save_clip x Save CLIP models to checkpoint
save_every_n_epochs x x Save checkpoint every N epochs
save_every_n_steps x x Save checkpoint every N steps
save_last_n_epochs x x Save last N checkpoints when saving every N epochs
save_last_n_epochs_state x x Save last N checkpoints of state
save_last_n_steps x x Save checkpoints until N steps elapsed
save_last_n_steps_state x x Save states until N steps elapsed
save_model_as x x Format to save the model
save_n_epoch_ratio x x Save checkpoint N epoch ratio
save_precision x x Precision in saving
save_state x x Save training state additionally when saving model
save_state_on_train_end x x Save training state on train end
save_state_to_huggingface x x Save state to huggingface
save_t5xxl x Save T5-XXL model to checkpoint
scale_v_pred_loss_like_noise_pred x x Scale v-prediction loss like noise prediction loss
sdpa x x Use sdpa for CrossAttention
secondary_separator x x Secondary separator for caption
seed x x Random seed for training
shuffle_caption x x Shuffle separated caption
t5xxl x T5-XXL model path
t5xxl_device x T5-XXL device
t5xxl_dtype x T5-XXL dtype
text_encoder_batch_size x Text encoder batch size
token_warmup_min x x Start learning at N tags
token_warmup_step x x Tag length reaches maximum on N steps
tokenizer_cache_dir x x Directory for caching Tokenizer
torch_compile x x Use torch.compile
train_batch_size x x Batch size for training
train_data_dir x x Directory for train images
train_text_encoder x Train text encoder
use_8bit_adam x x Use 8bit AdamW optimizer
use_lion_optimizer x x Use Lion optimizer
use_safetensors x x Use safetensors format to save
v2 x x Load Stable Diffusion v2.0 model
v_parameterization x x Enable v-parameterization training
v_pred_like_loss x x Add v-prediction like loss multiplied by this value
vae x x Path to checkpoint of VAE to replace
vae_batch_size x x Batch size for caching latents
wandb_api_key x x Specify WandB API key
wandb_run_name x x The name of the specific wandb session
weighted_captions x x Enable weighted captions
weighting_scheme x Options for weighting scheme
xformers x x Use xformers for CrossAttention
zero3_init_flag x x Flag to indicate whether to enable deepspeed.zero.Init
zero3_save_16bit_model x x Flag to indicate whether to save 16-bit model
zero_stage x x ZeRO stage for DeepSpeed
zero_terminal_snr x x Fix noise scheduler betas to enforce zero terminal SNR
bmaltais commented 4 months ago

Some info about the difference so far between sd3_train and sdxl_train:

Yes, sd3_train.py and sdxl_train.py are similar in structure and purpose, as they are both training scripts for large language models. However, there are several key differences between them:

  1. Model Architecture:

    • sd3_train.py is designed for training SD3 (Stable Diffusion 3) models.
    • sdxl_train.py is for training SDXL (Stable Diffusion XL) models.
  2. Text Encoders:

    • SD3 uses CLIP-L, CLIP-G, and T5-XXL as text encoders.
    • SDXL uses two CLIP text encoders (referred to as text_encoder1 and text_encoder2).
  3. UNet/Diffusion Model:

    • SD3 uses an MMDiT (Muse Multiscale Diffusion Transformer) model.
    • SDXL uses a modified UNet architecture.
  4. Training Process:

    • SD3 uses a flow matching approach for training.
    • SDXL uses a more traditional diffusion process.
  5. Scheduler:

    • SD3 uses a FlowMatchEulerDiscreteScheduler.
    • SDXL uses a DDPMScheduler.
  6. Conditioning:

    • SD3 has a different conditioning process, incorporating outputs from CLIP-L, CLIP-G, and optionally T5-XXL.
    • SDXL uses a combination of two CLIP encoder outputs and additional embeddings for size and crop information.
  7. Loss Calculation:

    • SD3 uses a flow matching loss.
    • SDXL uses a more standard diffusion loss.
  8. Argument Parsing:

    • There are differences in the command-line arguments accepted by each script, reflecting the different model architectures and training processes.
  9. Model Loading and Saving:

    • The scripts have different functions for loading and saving model checkpoints, reflecting the different model architectures.

While both scripts share a common overall structure (data loading, model setup, training loop, etc.), the specific implementation details differ significantly due to the architectural differences between SD3 and SDXL. The SD3 training process appears to be more complex, incorporating multiple text encoders and a flow matching approach, while the SDXL training process is more similar to previous Stable Diffusion versions.

vsatyamesc commented 4 months ago

Some info about the difference so far between sd3_train and sdxl_train:

Yes, sd3_train.py and sdxl_train.py are similar in structure and purpose, as they are both training scripts for large language models. However, there are several key differences between them:

1. Model Architecture:

   * sd3_train.py is designed for training SD3 (Stable Diffusion 3) models.
   * sdxl_train.py is for training SDXL (Stable Diffusion XL) models.

2. Text Encoders:

   * SD3 uses CLIP-L, CLIP-G, and T5-XXL as text encoders.
   * SDXL uses two CLIP text encoders (referred to as text_encoder1 and text_encoder2).

3. UNet/Diffusion Model:

   * SD3 uses an MMDiT (Muse Multiscale Diffusion Transformer) model.
   * SDXL uses a modified UNet architecture.

4. Training Process:

   * SD3 uses a flow matching approach for training.
   * SDXL uses a more traditional diffusion process.

5. Scheduler:

   * SD3 uses a FlowMatchEulerDiscreteScheduler.
   * SDXL uses a DDPMScheduler.

6. Conditioning:

   * SD3 has a different conditioning process, incorporating outputs from CLIP-L, CLIP-G, and optionally T5-XXL.
   * SDXL uses a combination of two CLIP encoder outputs and additional embeddings for size and crop information.

7. Loss Calculation:

   * SD3 uses a flow matching loss.
   * SDXL uses a more standard diffusion loss.

8. Argument Parsing:

   * There are differences in the command-line arguments accepted by each script, reflecting the different model architectures and training processes.

9. Model Loading and Saving:

   * The scripts have different functions for loading and saving model checkpoints, reflecting the different model architectures.

While both scripts share a common overall structure (data loading, model setup, training loop, etc.), the specific implementation details differ significantly due to the architectural differences between SD3 and SDXL. The SD3 training process appears to be more complex, incorporating multiple text encoders and a flow matching approach, while the SDXL training process is more similar to previous Stable Diffusion versions.

any updates?

bmaltais commented 4 months ago

Waiting for sd-scripts to complete the work on the sd3 branch and merge to dev. Once it is merged in the sd-scripts dev and things are stable I will work the GUI integration.

vsatyamesc commented 4 months ago

It's actually pretty complete tho. Although having a bit problems running that code. People have been able to train on it. Well take your time.

bmaltais commented 4 months ago

I have received good parameters to test training with SD3. I will begin the work on the GUI. WIP will be available in the sd3 branch for those who are curious and might want to test it out.

bmaltais commented 4 months ago

OK, I think there is now an MVP (Minimum Viable Product) for SD3 training. Here is how to use it:

  1. git pull ; git checkout sd3
  2. Select the SD3 medium safetensor model as the Pretrained model name or path
  3. Select the SD3 checkbox
  4. Under parameters, under the Basic accordion, you will find the SD3 accordion for SD3 specific parameters:

image

That is it. Everything else is up to you to configure for the training. Let me know how it goes and share back presets that I can add to the GUI for others to start quickly with finetuning SD3 or creating a Dreambooth (essentially the same as Finetuning).

BenDes21 commented 4 months ago

OK, I think there is now an MVP (Minimum Viable Product) for SD3 training. Here is how to use it:

  1. git pull ; git checkout sd3
  2. Select the SD3 medium safetensor model as the Pretrained model name or path
  3. Select the SD3 checkbox
  4. Under parameters, under the Basic accordion, you will find the SD3 accordion for SD3 specific parameters:

image

That is it. Everything else is up to you to configure for the training. Let me know how it goes and share back presets that I can add to the GUI for others to start quickly with finetuning SD3 or creating a Dreambooth (essentially the same as Finetuning).

Nice !!! thanks, for the sd3 model, is it the sd3_medium_incl_clips_t5xxlfp8.safetensors ?

vsatyamesc commented 4 months ago

OK, I think there is now an MVP (Minimum Viable Product) for SD3 training. Here is how to use it:

  1. git pull ; git checkout sd3
  2. Select the SD3 medium safetensor model as the Pretrained model name or path
  3. Select the SD3 checkbox
  4. Under parameters, under the Basic accordion, you will find the SD3 accordion for SD3 specific parameters:

image

That is it. Everything else is up to you to configure for the training. Let me know how it goes and share back presets that I can add to the GUI for others to start quickly with finetuning SD3 or creating a Dreambooth (essentially the same as Finetuning).

Can we use clip merged model or the, sd_medium(4Gb) and other clip models separately

WarAnakin commented 4 months ago

image

WarAnakin commented 4 months ago

it's throwing a few errors. The 1st one has to do with caching the text encoder, which i was able to get over and then it's throwing the error above regarding the tokenizer

bmaltais commented 4 months ago

It is quite possible kohya might have to fix a few things at his end. Today I will try to do an actual training. Yesterday was just to build the GUI as a minimum product to start playing with and discover potential issues.