Open furry123fdsare opened 4 months ago
Is SD3 training available?
Script is done bun still not available in kohya((
It is still in development at the moment. Once things stabalise and it get merged in the dev branch I will start work on the GUI to integrate it. 1st I will need to understand all the specific parameters for SD3 so I can create an SD3 specific GUI section for the model... so keep faith. One day it will become available... but I can't be faster than sd-scripts releases.
I did a comparison between the new sd3 trainer and sdxl to see how much has been caried over, and added. Here is the table so far:
Option | SD3 | SDXL | Description |
---|---|---|---|
adaptive_noise_scale | x | x | Add latent mean absolute value * this value to noise_offset |
alpha_mask | x | x | Use alpha channel as mask for training |
async_upload | x | x | Upload to huggingface asynchronously |
block_lr | x | Learning rates for each block of U-Net, comma-separated, 23 values | |
bucket_no_upscale | x | x | Make bucket for each image without upscaling |
bucket_reso_steps | x | x | Steps of resolution for buckets, divisible by 8 is recommended |
cache_info | x | x | Cache meta information (caption and image size) for faster dataset loading |
cache_latents | x | x | Cache latents to main memory to reduce VRAM usage |
cache_latents_to_disk | x | x | Cache latents to disk to reduce VRAM usage |
cache_text_encoder_outputs | x | x | Cache text encoder outputs |
cache_text_encoder_outputs_to_disk | x | x | Cache text encoder outputs to disk |
caption_dropout_every_n_epochs | x | x | Dropout all captions every N epochs |
caption_dropout_rate | x | x | Rate of dropout for captions (0.0~1.0) |
caption_extension | x | x | Extension of caption files |
caption_extention | x | x | Extension of caption files (backward compatibility) |
caption_prefix | x | x | Prefix for caption text |
caption_separator | x | x | Separator for caption |
caption_suffix | x | x | Suffix for caption text |
caption_tag_dropout_rate | x | x | Rate of dropout for comma-separated tokens (0.0~1.0) |
clip_g | x | CLIP-G model path | |
clip_l | x | CLIP-L model path | |
clip_skip | x | x | Use output of nth layer from back of text encoder (n>=1) |
color_aug | x | x | Enable weak color augmentation |
conditioning_data_dir | x | x | Conditioning data directory |
config_file | x | x | Config file for detail settings |
console_log_file | x | x | Log to a file instead of stderr |
console_log_level | x | x | Set the logging level |
console_log_simple | x | x | Simple log output |
dataset_class | x | x | Dataset class for arbitrary dataset |
dataset_config | x | x | Config file for detail settings |
dataset_repeats | x | x | Repeat dataset when training with captions |
ddp_gradient_as_bucket_view | x | x | Enable gradient_as_bucket_view for DDP |
ddp_static_graph | x | x | Enable static_graph for DDP |
ddp_timeout | x | x | DDP timeout (min, None for default of accelerate) |
debiased_estimation_loss | x | x | Use debiased estimation loss |
debug_dataset | x | x | Show images for debugging (do not train) |
deepspeed | x | x | Enable deepspeed training |
diffusers_xformers | x | Use xformers by diffusers | |
disable_mmap_load_safetensors | x | x | Disable mmap load for safetensors |
dynamo_backend | x | x | Dynamo backend type |
enable_bucket | x | x | Enable buckets for multi aspect ratio training |
enable_wildcard | x | x | Enable wildcard for caption |
face_crop_aug_range | x | x | Enable face-centered crop augmentation and its range |
flip_aug | x | x | Enable horizontal flip augmentation |
fp16_master_weights_and_gradients | x | x | Use fp16 master weights and gradients |
fp8_base | x | x | Use fp8 for base model |
full_bf16 | x | x | BF16 training including gradients |
full_fp16 | x | x | FP16 training including gradients |
fused_backward_pass | x | x | Combines backward pass and optimizer step to reduce VRAM usage |
fused_optimizer_groups | x | x | Number of optimizers for fused backward pass and optimizer step |
gradient_accumulation_steps | x | x | Number of updates steps to accumulate before performing a backward/update pass |
gradient_checkpointing | x | x | Enable gradient checkpointing |
highvram | x | x | Disable low VRAM optimization |
huber_c | x | x | The huber loss parameter |
huber_schedule | x | x | The scheduling method for Huber loss |
huggingface_path_in_repo | x | x | Huggingface model path to upload files |
huggingface_repo_id | x | x | Huggingface repo name to upload |
huggingface_repo_type | x | x | Huggingface repo type to upload |
huggingface_repo_visibility | x | x | Huggingface repository visibility |
huggingface_token | x | x | Huggingface token |
in_json | x | x | JSON metadata for dataset |
ip_noise_gamma | x | x | Enable input perturbation noise |
ip_noise_gamma_random_strength | x | x | Use random strength for input perturbation noise |
keep_tokens | x | x | Keep heading N tokens when shuffling caption tokens |
keep_tokens_separator | x | x | Custom separator to divide caption into fixed and flexible parts |
learning_rate | x | x | Learning rate |
learning_rate_te1 | x | Learning rate for text encoder 1 (ViT-L) | |
learning_rate_te2 | x | Learning rate for text encoder 2 (BiG-G) | |
log_config | x | x | Log training configuration |
log_prefix | x | x | Add prefix for each log directory |
log_tracker_config | x | x | Path to tracker config file to use for logging |
log_tracker_name | x | x | Name of tracker to use for logging |
log_with | x | x | What logging tool(s) to use |
logging_dir | x | x | Enable logging and output TensorBoard log to this directory |
logit_mean | x | Mean to use for logit normal weighting scheme | |
logit_std | x | Standard deviation to use for logit normal weighting scheme | |
loss_type | x | x | The type of loss function to use |
lowram | x | x | Enable low RAM optimization |
lr_scheduler | x | x | Scheduler to use for learning rate |
lr_scheduler_args | x | x | Additional arguments for scheduler |
lr_scheduler_num_cycles | x | x | Number of restarts for cosine scheduler with restarts |
lr_scheduler_power | x | x | Polynomial power for polynomial scheduler |
lr_scheduler_type | x | x | Custom scheduler module |
lr_warmup_steps | x | x | Number of steps for the warmup in the lr scheduler |
masked_loss | x | x | Apply mask for calculating loss |
max_bucket_reso | x | x | Maximum resolution for buckets |
max_data_loader_n_workers | x | x | Max num workers for DataLoader |
max_grad_norm | x | x | Max gradient norm, 0 for no clipping |
max_timestep | x | x | Set maximum time step for U-Net training |
max_token_length | x | x | Max token length of text encoder |
max_train_epochs | x | x | Training epochs (overrides max_train_steps) |
max_train_steps | x | x | Training steps |
mem_eff_attn | x | x | Use memory efficient attention for CrossAttention |
metadata_author | x | x | Author name for model metadata |
metadata_description | x | x | Description for model metadata |
metadata_license | x | x | License for model metadata |
metadata_tags | x | x | Tags for model metadata, separated by comma |
metadata_title | x | x | Title for model metadata |
min_bucket_reso | x | x | Minimum resolution for buckets |
min_snr_gamma | x | x | Gamma for reducing the weight of high loss timesteps |
min_timestep | x | x | Set minimum time step for U-Net training |
mixed_precision | x | x | Use mixed precision |
mode_scale | x | Scale of mode weighting scheme | |
multires_noise_discount | x | x | Set discount value for multires noise |
multires_noise_iterations | x | x | Enable multires noise with this number of iterations |
no_half_vae | x | Do not use fp16/bf16 VAE in mixed precision | |
noise_offset | x | x | Enable noise offset with this value |
noise_offset_random_strength | x | x | Use random strength for noise offset |
offload_optimizer_device | x | x | Device for offloading optimizer |
offload_optimizer_nvme_path | x | x | NVMe path for offloading optimizer |
offload_param_device | x | x | Device for offloading parameters |
offload_param_nvme_path | x | x | NVMe path for offloading parameters |
optimizer_args | x | x | Additional arguments for optimizer |
optimizer_type | x | x | Optimizer to use |
output_config | x | x | Output command line args to given .toml file |
output_dir | x | x | Directory to output trained model |
output_name | x | x | Base name of trained model file |
persistent_data_loader_workers | x | x | Persistent DataLoader workers |
pretrained_model_name_or_path | x | x | Pretrained model to train, directory to Diffusers model or StableDiffusion checkpoint |
random_crop | x | x | Enable random crop |
reg_data_dir | x | x | Directory for regularization images |
resolution | x | x | Resolution in training |
resume | x | x | Saved state to resume training |
resume_from_huggingface | x | x | Resume from huggingface |
sample_at_first | x | x | Generate sample images before training |
sample_every_n_epochs | x | x | Generate sample images every N epochs |
sample_every_n_steps | x | x | Generate sample images every N steps |
sample_prompts | x | x | File for prompts to generate sample images |
sample_sampler | x | x | Sampler (scheduler) type for sample images |
save_clip | x | Save CLIP models to checkpoint | |
save_every_n_epochs | x | x | Save checkpoint every N epochs |
save_every_n_steps | x | x | Save checkpoint every N steps |
save_last_n_epochs | x | x | Save last N checkpoints when saving every N epochs |
save_last_n_epochs_state | x | x | Save last N checkpoints of state |
save_last_n_steps | x | x | Save checkpoints until N steps elapsed |
save_last_n_steps_state | x | x | Save states until N steps elapsed |
save_model_as | x | x | Format to save the model |
save_n_epoch_ratio | x | x | Save checkpoint N epoch ratio |
save_precision | x | x | Precision in saving |
save_state | x | x | Save training state additionally when saving model |
save_state_on_train_end | x | x | Save training state on train end |
save_state_to_huggingface | x | x | Save state to huggingface |
save_t5xxl | x | Save T5-XXL model to checkpoint | |
scale_v_pred_loss_like_noise_pred | x | x | Scale v-prediction loss like noise prediction loss |
sdpa | x | x | Use sdpa for CrossAttention |
secondary_separator | x | x | Secondary separator for caption |
seed | x | x | Random seed for training |
shuffle_caption | x | x | Shuffle separated caption |
t5xxl | x | T5-XXL model path | |
t5xxl_device | x | T5-XXL device | |
t5xxl_dtype | x | T5-XXL dtype | |
text_encoder_batch_size | x | Text encoder batch size | |
token_warmup_min | x | x | Start learning at N tags |
token_warmup_step | x | x | Tag length reaches maximum on N steps |
tokenizer_cache_dir | x | x | Directory for caching Tokenizer |
torch_compile | x | x | Use torch.compile |
train_batch_size | x | x | Batch size for training |
train_data_dir | x | x | Directory for train images |
train_text_encoder | x | Train text encoder | |
use_8bit_adam | x | x | Use 8bit AdamW optimizer |
use_lion_optimizer | x | x | Use Lion optimizer |
use_safetensors | x | x | Use safetensors format to save |
v2 | x | x | Load Stable Diffusion v2.0 model |
v_parameterization | x | x | Enable v-parameterization training |
v_pred_like_loss | x | x | Add v-prediction like loss multiplied by this value |
vae | x | x | Path to checkpoint of VAE to replace |
vae_batch_size | x | x | Batch size for caching latents |
wandb_api_key | x | x | Specify WandB API key |
wandb_run_name | x | x | The name of the specific wandb session |
weighted_captions | x | x | Enable weighted captions |
weighting_scheme | x | Options for weighting scheme | |
xformers | x | x | Use xformers for CrossAttention |
zero3_init_flag | x | x | Flag to indicate whether to enable deepspeed.zero.Init |
zero3_save_16bit_model | x | x | Flag to indicate whether to save 16-bit model |
zero_stage | x | x | ZeRO stage for DeepSpeed |
zero_terminal_snr | x | x | Fix noise scheduler betas to enforce zero terminal SNR |
Some info about the difference so far between sd3_train and sdxl_train:
Yes, sd3_train.py and sdxl_train.py are similar in structure and purpose, as they are both training scripts for large language models. However, there are several key differences between them:
Model Architecture:
Text Encoders:
UNet/Diffusion Model:
Training Process:
Scheduler:
Conditioning:
Loss Calculation:
Argument Parsing:
Model Loading and Saving:
While both scripts share a common overall structure (data loading, model setup, training loop, etc.), the specific implementation details differ significantly due to the architectural differences between SD3 and SDXL. The SD3 training process appears to be more complex, incorporating multiple text encoders and a flow matching approach, while the SDXL training process is more similar to previous Stable Diffusion versions.
Some info about the difference so far between sd3_train and sdxl_train:
Yes, sd3_train.py and sdxl_train.py are similar in structure and purpose, as they are both training scripts for large language models. However, there are several key differences between them:
1. Model Architecture: * sd3_train.py is designed for training SD3 (Stable Diffusion 3) models. * sdxl_train.py is for training SDXL (Stable Diffusion XL) models. 2. Text Encoders: * SD3 uses CLIP-L, CLIP-G, and T5-XXL as text encoders. * SDXL uses two CLIP text encoders (referred to as text_encoder1 and text_encoder2). 3. UNet/Diffusion Model: * SD3 uses an MMDiT (Muse Multiscale Diffusion Transformer) model. * SDXL uses a modified UNet architecture. 4. Training Process: * SD3 uses a flow matching approach for training. * SDXL uses a more traditional diffusion process. 5. Scheduler: * SD3 uses a FlowMatchEulerDiscreteScheduler. * SDXL uses a DDPMScheduler. 6. Conditioning: * SD3 has a different conditioning process, incorporating outputs from CLIP-L, CLIP-G, and optionally T5-XXL. * SDXL uses a combination of two CLIP encoder outputs and additional embeddings for size and crop information. 7. Loss Calculation: * SD3 uses a flow matching loss. * SDXL uses a more standard diffusion loss. 8. Argument Parsing: * There are differences in the command-line arguments accepted by each script, reflecting the different model architectures and training processes. 9. Model Loading and Saving: * The scripts have different functions for loading and saving model checkpoints, reflecting the different model architectures.
While both scripts share a common overall structure (data loading, model setup, training loop, etc.), the specific implementation details differ significantly due to the architectural differences between SD3 and SDXL. The SD3 training process appears to be more complex, incorporating multiple text encoders and a flow matching approach, while the SDXL training process is more similar to previous Stable Diffusion versions.
any updates?
Waiting for sd-scripts to complete the work on the sd3 branch and merge to dev. Once it is merged in the sd-scripts dev and things are stable I will work the GUI integration.
It's actually pretty complete tho. Although having a bit problems running that code. People have been able to train on it. Well take your time.
I have received good parameters to test training with SD3. I will begin the work on the GUI. WIP will be available in the sd3 branch for those who are curious and might want to test it out.
OK, I think there is now an MVP (Minimum Viable Product) for SD3 training. Here is how to use it:
git pull ; git checkout sd3
Pretrained model name or path
That is it. Everything else is up to you to configure for the training. Let me know how it goes and share back presets that I can add to the GUI for others to start quickly with finetuning SD3 or creating a Dreambooth (essentially the same as Finetuning).
OK, I think there is now an MVP (Minimum Viable Product) for SD3 training. Here is how to use it:
git pull ; git checkout sd3
- Select the SD3 medium safetensor model as the
Pretrained model name or path
- Select the SD3 checkbox
- Under parameters, under the Basic accordion, you will find the SD3 accordion for SD3 specific parameters:
That is it. Everything else is up to you to configure for the training. Let me know how it goes and share back presets that I can add to the GUI for others to start quickly with finetuning SD3 or creating a Dreambooth (essentially the same as Finetuning).
Nice !!! thanks, for the sd3 model, is it the sd3_medium_incl_clips_t5xxlfp8.safetensors ?
OK, I think there is now an MVP (Minimum Viable Product) for SD3 training. Here is how to use it:
git pull ; git checkout sd3
- Select the SD3 medium safetensor model as the
Pretrained model name or path
- Select the SD3 checkbox
- Under parameters, under the Basic accordion, you will find the SD3 accordion for SD3 specific parameters:
That is it. Everything else is up to you to configure for the training. Let me know how it goes and share back presets that I can add to the GUI for others to start quickly with finetuning SD3 or creating a Dreambooth (essentially the same as Finetuning).
Can we use clip merged model or the, sd_medium(4Gb) and other clip models separately
it's throwing a few errors. The 1st one has to do with caching the text encoder, which i was able to get over and then it's throwing the error above regarding the tokenizer
It is quite possible kohya might have to fix a few things at his end. Today I will try to do an actual training. Yesterday was just to build the GUI as a minimum product to start playing with and discover potential issues.
Is SD3 training available?