SDXL Lora training Extremely slow on Rtx 4090

etha302 commented 1 year ago

As the title says, training lora for sdxl on 4090 is painfully slow. It needs at least 15-20 seconds to complete 1 single step, so it is impossible to train. i dont know whether i am doing something wrong, but here are screenshot of my settings. Also it is using full 24gb of ram, but it is so slow that even gpu fans are not spinning.

etha302 commented 1 year ago

a little faster with these optimizer args: scale_parameter=False relative_step=False warmup_init=False but still slow, especially because it is 4090, cant imagine how slow it would be on slower cards. so i think this has to be addressed, unless i am doing something terribly wrong.

AIrtistry commented 1 year ago

same problem with me with a 4080

v0xie commented 1 year ago

Here's a configuration file that's been working well for me on a 4090. Outputs a ~5 MB LoRA that works with A1111 and ComfyUI.

The parameters to tweak are:

T_max : set to the total number of steps
train_batch_size: set according to your dataset size (1-4 works, 4+ works but is slow)
gradient_accumulation_steps: higher is usually faster, not sure what the upper bound is

What I like to do is start up Tensorboard and do a couple of test runs with different train_batch_size/gradient_accumulation_step combinations and see what settings train the fastest with the size of my dataset.

This Rentry article was used as reference: https://rentry.co/ProdiAgy

As far as I can tell, IA^3 training does not work for SDXL yet, and that is anecdotally much faster than training other LoRA.

As benchmark, I can train a LoKR LoRA with dataset size of 20 for 250 epochs (1 repeat / epoch) in about 1.5 hours.

{  
  "LoRA_type": "LyCORIS/LoKr",
  "adaptive_noise_scale": 0,
  "additional_parameters": "--network_train_unet_only --lr_scheduler_type \"CosineAnnealingLR\" --lr_scheduler_args \"T_max=975\" \"eta_min=0.000\"",
  "block_alphas": "",
  "block_dims": "",
  "block_lr_zero_threshold": "",
  "bucket_no_upscale": true,
  "bucket_reso_steps": 32,
  "cache_latents": true,
  "cache_latents_to_disk": true,
  "caption_dropout_every_n_epochs": 0.0,
  "caption_dropout_rate": 0,
  "caption_extension": ".txt",
  "clip_skip": "1",
  "color_aug": false,
  "conv_alpha": 64,
  "conv_block_alphas": "",
  "conv_block_dims": "",
  "conv_dim": 64,
  "decompose_both": false,
  "dim_from_weights": false,
  "down_lr_weight": "",
  "enable_bucket": true,
  "epoch": 300,
  "factor": -1,
  "flip_aug": false,
  "full_bf16": true,
  "full_fp16": false,
  "gradient_accumulation_steps": 4.0,
  "gradient_checkpointing": true,
  "keep_tokens": 1,
  "learning_rate": 1.0,
  "logging_dir": "",
  "lora_network_weights": "",
  "lr_scheduler": "cosine",
  "lr_scheduler_num_cycles": "",
  "lr_scheduler_power": "",
  "lr_warmup": 0,
  "max_bucket_reso": 2048,
  "max_data_loader_n_workers": "0",
  "max_resolution": "1024,1024",
  "max_timestep": 1000,
  "max_token_length": "75",
  "max_train_epochs": "",
  "mem_eff_attn": false,
  "mid_lr_weight": "",
  "min_bucket_reso": 256,
  "min_snr_gamma": 3,
  "min_timestep": 0,
  "mixed_precision": "bf16",
  "model_list": "custom",
  "module_dropout": 0,
  "multires_noise_discount": 0.2,
  "multires_noise_iterations": 8,
  "network_alpha": 64,
  "network_dim": 64,
  "network_dropout": 0,
  "no_token_padding": false,
  "noise_offset": 0.0357,
  "noise_offset_type": "Original",
  "num_cpu_threads_per_process": 8,
  "optimizer": "Prodigy",
  "optimizer_args": "\"betas=0.9,0.999\" \"d0=1e-2\" \"d_coef=1.0\" \"weight_decay=0.400\" \"use_bias_correction=False\" \"safeguard_warmup=False\"",
  "output_dir": "",
  "output_name": "",
  "persistent_data_loader_workers": false,
  "pretrained_model_name_or_path": "stabilityai/stable-diffusion-xl-base-1.0",
  "prior_loss_weight": 1.0,
  "random_crop": false,
  "rank_dropout": 0,
  "reg_data_dir": "",
  "resume": "",
  "sample_prompts": "",
  "sample_sampler": "euler_a",
  "save_every_n_epochs": 5,
  "save_every_n_steps": 0,
  "save_last_n_steps": 0,
  "save_last_n_steps_state": 0,
  "save_model_as": "safetensors",
  "save_precision": "bf16",
  "save_state": false,
  "scale_v_pred_loss_like_noise_pred": false,
  "scale_weight_norms": 1,
  "sdxl": true,
  "sdxl_cache_text_encoder_outputs": false,
  "sdxl_no_half_vae": true,
  "seed": "31337",
  "shuffle_caption": false,
  "stop_text_encoder_training": 0,
  "text_encoder_lr": 1.0,
  "train_batch_size": 1,
  "train_data_dir": "",
  "train_on_input": true,
  "training_comment": "",
  "unet_lr": 1.0,
  "unit": 1,
  "up_lr_weight": "",
  "use_cp": true,
  "use_wandb": false,
  "v2": false,
  "v_parameterization": false,
  "vae_batch_size": 0,
  "wandb_api_key": "",
  "weighted_captions": false,
  "xformers": true
}

etha302 commented 1 year ago

Here's a configuration file that's been working well for me on a 4090. Outputs a ~5 MB LoRA that works with A1111 and ComfyUI.

The parameters to tweak are:

T_max : set to the total number of steps
train_batch_size: set according to your dataset size (1-4 works, 4+ works but is slow)
gradient_accumulation_steps: higher is usually faster, not sure what the upper bound is

What I like to do is start up Tensorboard and do a couple of test runs with different train_batch_size/gradient_accumulation_step combinations and see what settings train the fastest with the size of my dataset.

This Rentry article was used as reference: https://rentry.co/ProdiAgy

As far as I can tell, IA^3 training does not work for SDXL yet, and that is anecdotally much faster than training other LoRA.

As benchmark, I can train a LoKR LoRA with dataset size of 20 for 250 epochs (1 repeat / epoch) in about 1.5 hours.

{  
  "LoRA_type": "LyCORIS/LoKr",
  "adaptive_noise_scale": 0,
  "additional_parameters": "--network_train_unet_only --lr_scheduler_type \"CosineAnnealingLR\" --lr_scheduler_args \"T_max=975\" \"eta_min=0.000\"",
  "block_alphas": "",
  "block_dims": "",
  "block_lr_zero_threshold": "",
  "bucket_no_upscale": true,
  "bucket_reso_steps": 32,
  "cache_latents": true,
  "cache_latents_to_disk": true,
  "caption_dropout_every_n_epochs": 0.0,
  "caption_dropout_rate": 0,
  "caption_extension": ".txt",
  "clip_skip": "1",
  "color_aug": false,
  "conv_alpha": 64,
  "conv_block_alphas": "",
  "conv_block_dims": "",
  "conv_dim": 64,
  "decompose_both": false,
  "dim_from_weights": false,
  "down_lr_weight": "",
  "enable_bucket": true,
  "epoch": 300,
  "factor": -1,
  "flip_aug": false,
  "full_bf16": true,
  "full_fp16": false,
  "gradient_accumulation_steps": 4.0,
  "gradient_checkpointing": true,
  "keep_tokens": 1,
  "learning_rate": 1.0,
  "logging_dir": "",
  "lora_network_weights": "",
  "lr_scheduler": "cosine",
  "lr_scheduler_num_cycles": "",
  "lr_scheduler_power": "",
  "lr_warmup": 0,
  "max_bucket_reso": 2048,
  "max_data_loader_n_workers": "0",
  "max_resolution": "1024,1024",
  "max_timestep": 1000,
  "max_token_length": "75",
  "max_train_epochs": "",
  "mem_eff_attn": false,
  "mid_lr_weight": "",
  "min_bucket_reso": 256,
  "min_snr_gamma": 3,
  "min_timestep": 0,
  "mixed_precision": "bf16",
  "model_list": "custom",
  "module_dropout": 0,
  "multires_noise_discount": 0.2,
  "multires_noise_iterations": 8,
  "network_alpha": 64,
  "network_dim": 64,
  "network_dropout": 0,
  "no_token_padding": false,
  "noise_offset": 0.0357,
  "noise_offset_type": "Original",
  "num_cpu_threads_per_process": 8,
  "optimizer": "Prodigy",
  "optimizer_args": "\"betas=0.9,0.999\" \"d0=1e-2\" \"d_coef=1.0\" \"weight_decay=0.400\" \"use_bias_correction=False\" \"safeguard_warmup=False\"",
  "output_dir": "",
  "output_name": "",
  "persistent_data_loader_workers": false,
  "pretrained_model_name_or_path": "stabilityai/stable-diffusion-xl-base-1.0",
  "prior_loss_weight": 1.0,
  "random_crop": false,
  "rank_dropout": 0,
  "reg_data_dir": "",
  "resume": "",
  "sample_prompts": "",
  "sample_sampler": "euler_a",
  "save_every_n_epochs": 5,
  "save_every_n_steps": 0,
  "save_last_n_steps": 0,
  "save_last_n_steps_state": 0,
  "save_model_as": "safetensors",
  "save_precision": "bf16",
  "save_state": false,
  "scale_v_pred_loss_like_noise_pred": false,
  "scale_weight_norms": 1,
  "sdxl": true,
  "sdxl_cache_text_encoder_outputs": false,
  "sdxl_no_half_vae": true,
  "seed": "31337",
  "shuffle_caption": false,
  "stop_text_encoder_training": 0,
  "text_encoder_lr": 1.0,
  "train_batch_size": 1,
  "train_data_dir": "",
  "train_on_input": true,
  "training_comment": "",
  "unet_lr": 1.0,
  "unit": 1,
  "up_lr_weight": "",
  "use_cp": true,
  "use_wandb": false,
  "v2": false,
  "v_parameterization": false,
  "vae_batch_size": 0,
  "wandb_api_key": "",
  "weighted_captions": false,
  "xformers": true
}

thanks for sharing the workflow, will def. give it a try. i am only a little concerned on network size, not sure if its good for training realistic. also 1.5 hours, is still allot for training a lora. And this has to get better overtime, imagine people with 3070, 3080, 4060 and so on, they will be training one lora whole day. After 4090 you are entering enterprise class, and then we are talking 5000-10000 usd for one card. And i believe most people including me cannot afford that, however lora with settings i used above, although it took hours and hours to train, it gave out very decent results, especially for the first try. edit: also i am looking forward, to actually start finetuning sdxl base model, once that --no_half_vae error is fixed.

bmaltais commented 1 year ago

SDXL training is also quite slow on my 3090... but not that slow. Not sure why it is this slow for you.

etha302 commented 1 year ago

SDXL training is also quite slow on my 3090... but not that slow. Not sure why it is this slow for you.

no idea. i tried everything. i trained another lora, with 6000 steps. and it took 13 hours to complete, about 7 seconds per step. results arent bad at all, but why is it so slow i just dont know. also it uses 24 gb of vram, no matter what i change. and this starts with caching latents, first it goes fast and after a few seconds when vram usage goes to 24 gb it gets super slow. then it stays like that through the whole training.

FurkanGozukara commented 1 year ago

SDXL training is also quite slow on my 3090... but not that slow. Not sure why it is this slow for you.

1.4 second / it for me

rtx 3090 ti - batch size 1 gradient 1

on rtx 3060 it is about 2.8 second / it

but on rtx 3060 i train 32 network rank , with rtx 3090 ti 256 network rank

MrPlatnum commented 1 year ago

I don't even get past the "caching latents" Stage. My PC just completely freezes when the 24 GB are full. Also on an 4090

etha302 commented 1 year ago

I don't even get past the "caching latents" Stage. My PC just completely freezes when the 24 GB are full. Also on an 4090

interesting, so it is even worse for you. this is why i didnt even attempt to use reg images yet, because this is probably what might happen. what i tried yesterday: reinstall xformers, cuda, new cudnn .dlls, different nvidia drivers and nothing. Whether something is really wrong with the script, but i dont think it is because not everyone has the same problem, what i will try today: reinstall windows, and install everything from scratch, and i will report back if this solves the problem. got discord then add me: davidk35, i will send my .json files for you to try

FurkanGozukara commented 1 year ago

I did a full test today

RTX 3060 is 2.4 second / it : https://twitter.com/GozukaraFurkan/status/1686296023751094273

RTX 3090 TI is 1.23 second / it : https://twitter.com/GozukaraFurkan/status/1686305740401541121

etha302 commented 1 year ago

I did a full test today

RTX 3060 is 2.4 second / it : https://twitter.com/GozukaraFurkan/status/1686296023751094273

RTX 3090 TI is 1.23 second / it : https://twitter.com/GozukaraFurkan/status/1686305740401541121

so after reinstalling windows and kohya speed is on 4090 is exactly the same, and that is around 1.9s-2s/it. so much slower than 3090 and barely any faster than 3060.

etha302 commented 1 year ago

it is clearly 4090 specific problem at this point, i was talking to many people and most have the same problem. different drivers dont help, new cudnn .dll files also not. Chosing adamw 8 bit optimizer uses only 12gb of vram and speed is at around 1.3s/it, so still very slow for a 4090 and results probably arent going to be good for SDXL. at this point, i really dont know what else to do.

wen020 commented 1 year ago

Does the training picture have to be 1024x1024?

bmaltais commented 1 year ago

Does the training picture have to be 1024x1024?

Don't have to be square... But the total number of pixels in the image should be equal or greater than 1024 x 1024.

Thom293 commented 1 year ago

Here's a configuration file that's been working well for me on a 4090. Outputs a ~5 MB LoRA that works with A1111 and ComfyUI.

The parameters to tweak are:
* `T_max` : set to the total number of steps

* `train_batch_size`: set according to your dataset size (1-4 works, 4+ works but is slow)

* `gradient_accumulation_steps`: higher is usually faster, not sure what the upper bound is
What I like to do is start up Tensorboard and do a couple of test runs with different train_batch_size/gradient_accumulation_step combinations and see what settings train the fastest with the size of my dataset.

This Rentry article was used as reference: https://rentry.co/ProdiAgy

As far as I can tell, IA^3 training does not work for SDXL yet, and that is anecdotally much faster than training other LoRA.

As benchmark, I can train a LoKR LoRA with dataset size of 20 for 250 epochs (1 repeat / epoch) in about 1.5 hours.

I used this method and it worked. I trained my first LORA. Thank you very much. Running a second one now. However, on a 4090 it is extremely slow.

20 images, 300 epocs, 10 repeats has taken most of a day. Id love some advice on how to make it quicker. This seems really slow. I havent trained a lora before but traning a TI I could do 50+ images on a mobiel 3080 16gb in a few hours.

Im around 250 and this is what the speed shows:

etha302 commented 1 year ago

Here's a configuration file that's been working well for me on a 4090. Outputs a ~5 MB LoRA that works with A1111 and ComfyUI. The parameters to tweak are:
* `T_max` : set to the total number of steps

* `train_batch_size`: set according to your dataset size (1-4 works, 4+ works but is slow)

* `gradient_accumulation_steps`: higher is usually faster, not sure what the upper bound is
What I like to do is start up Tensorboard and do a couple of test runs with different train_batch_size/gradient_accumulation_step combinations and see what settings train the fastest with the size of my dataset. This Rentry article was used as reference: https://rentry.co/ProdiAgy As far as I can tell, IA^3 training does not work for SDXL yet, and that is anecdotally much faster than training other LoRA. As benchmark, I can train a LoKR LoRA with dataset size of 20 for 250 epochs (1 repeat / epoch) in about 1.5 hours.
I used this method and it worked. I trained my first LORA. Thank you very much. Running a second one now. However, on a 4090 it is extremely slow.

20 images, 300 epocs, 10 repeats has taken most of a day. Id love some advice on how to make it quicker. This seems really slow. I havent trained a lora before but traning a TI I could do 50+ images on a mobiel 3080 16gb in a few hours.

Im around 250 and this is what the speed shows:

It’s even slower for you than me. I get around 2s/it which is still ridiculously slow. Everything points to nvidia drivers at this point, testing today on linux. Will report back later

MiloMindbender commented 1 year ago

I am having a similar issue, on 4090 SDXL lora training it is going about 1.82s/it and using all 24gb of ram even with a batch size of 1. This is the first SDXL training I have tried, and a new computer with 4090 I have not used for training before so I'm wondering if this speed an RAM use is normal for 4090?

settingsV2.txt

brianiup commented 1 year ago

I'm seeing this also with my 4090, all 24GB of VRAM get used up CUDA usage is at 99% and it's training at 10.2s/it for a 1024x1024 res

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="E:/Models/Stable-Diffusion/Checkpoints/SDXL1.0/sd_xl_base_1.0. safetensors" --train_data_dir="E:/Models/Stable-Diffusion/Training/Lora/test\img" --resolution="1024,1024" --output_dir="E:/Models/Stable-Diffusion/Training/Lora/test\model" --logging_dir="E:/Models/Stable-Diffusion/Training/Lora/test\log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0001 --unet_lr=0.0001 --network_dim=256 --output_name="test" --lr_scheduler_num_cycles="5" --no_half_vae --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5800" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="12345" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="4" --bucket_reso_steps=64 --save_state --xformers --bucket_no_upscale --noise_offset=0.0357 --sample_sampler=euler_a --sample_prompts="E:/Models/Stable-Diffusion/Training/Lora/test\model\sample\prompt.txt" --sample_every_n_steps="100"

Version: v21.8.5 Torch 2.0.1+cu118 Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128

Even when it generates sample prompt images it's way slower then in automatic1111,

height: 1024 width: 1024 sample_steps: 40 scale: 8.0

28%|██████████████████████▌ | 11/40 [00:18<00:49, 1.70s/it]

in A1111 i get 10 it/s with that res and eulera at that res

FurkanGozukara commented 1 year ago

I'm seeing this also with my 4090, all 24GB of VRAM get used up CUDA usage is at 99% and it's training at 10.2s/it for a 1024x1024 res

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="E:/Models/Stable-Diffusion/Checkpoints/SDXL1.0/sd_xl_base_1.0. safetensors" --train_data_dir="E:/Models/Stable-Diffusion/Training/Lora/test\img" --resolution="1024,1024" --output_dir="E:/Models/Stable-Diffusion/Training/Lora/test\model" --logging_dir="E:/Models/Stable-Diffusion/Training/Lora/test\log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0001 --unet_lr=0.0001 --network_dim=256 --output_name="test" --lr_scheduler_num_cycles="5" --no_half_vae --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5800" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="12345" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="4" --bucket_reso_steps=64 --save_state --xformers --bucket_no_upscale --noise_offset=0.0357 --sample_sampler=euler_a --sample_prompts="E:/Models/Stable-Diffusion/Training/Lora/test\model\sample\prompt.txt" --sample_every_n_steps="100"

Version: v21.8.5 Torch 2.0.1+cu118 Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128

Even when it generates sample prompt images it's way slower then in automatic1111,

height: 1024 width: 1024 sample_steps: 40 scale: 8.0

28%|██████████████████████▌ | 11/40 [00:18<00:49, 1.70s/it]

in A1111 i get 10 it/s with that res and eulera at that res

wow this is terrible

if you are my patreon supporter i would like to connect your pc and try to help

6b6a72 commented 1 year ago

At work so I can't grab any specifics right now - one thing to check is shared GPU memory usage.

I haven't had tons of time to experiment yet but, in my case, using regularization images from unsplash that I changed to have a max length/height of 2048 pushed the GPU to use shared VRAM during latent caching. Once shared GPU memory is in use, performance will suffer greatly. Latent caching is speedy for maybe 80-100 steps, then it's like the horrible times y'all are sharing (e.g., 10 s/it).

Changing to 1024x1024 images I generated myself from SDXL gets latent caching to a decent 5-8 it/s, and training with whatever my settings are was yielding about 1.2 it/s during training of a standard LoRA.

My numbers might be off a little bit - I can test more later - but check shared GPU memory and make it's not being used. Open Task Manager and check the Performance tab.

brianiup commented 1 year ago

I turned bucketing off and went to 1024x1024 fixed size images and now I am getting speeds of 1.22s/it, it seems to have something to do with buckets and adafactor optimizer for me, still trying different combos, but bucketing off has a drastic improvement

FurkanGozukara commented 1 year ago

I turned bucketing off and went to 1024x1024 fixed size images and now I am getting speeds of 1.22s/it, it seems to have something to do with buckets and adafactor optimizer for me, still trying different combos, but bucketing off has a drastic improvement

nice info

i also saw someone were getting errors due to bucketing system bug

6b6a72 commented 1 year ago

Leaving buckets enabled and unchecking Don't upscale bucket resolution in Advanced is providing some acceptable results.

I tested with 10 source images, all at least 1600x1600, and the original resolution Unsplash pics (smallest of which is 1155x1732) for regularization. No shared memory usage (though it was close) during caching, with 4-7 it/s or so. Training steps actually ran the fastest I've seen yet at a very steady 1.53-1.54 it/s.

Not a permanent solution but it's working and I haven't seen anything better yet so ... YMMV

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket
  --min_bucket_reso=256 --max_bucket_reso=1024
  --pretrained_model_name_or_path="D:/sd/models/sdxl_v1/sd_xl_base_1.0.safetensors"
  --train_data_dir="output/img" --reg_data_dir="D:/unsplash/woman"
  --resolution="1024,1024" --output_dir="output/model"
  --logging_dir="output/log" --network_alpha="48"
  --save_model_as=safetensors --network_module=networks.lora --unet_lr=0.0001
  --network_train_unet_only --network_dim=96 --output_name="txmo_xl"
  --lr_scheduler_num_cycles="3" --cache_text_encoder_outputs --no_half_vae --full_bf16
  --learning_rate="0.0001" --lr_scheduler="constant_with_warmup" --lr_warmup_steps="240"
  --train_batch_size="1" --max_train_steps="2400" --save_every_n_epochs="1"
  --mixed_precision="bf16" --save_precision="bf16" --cache_latents --optimizer_type="Adafactor"
  --optimizer_args scale_parameter=False relative_step=False warmup_init=False
  --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --noise_offset=0.0357

brianiup commented 1 year ago

At work so I can't grab any specifics right now - one thing to check is shared GPU memory usage.

I haven't had tons of time to experiment yet but, in my case, using regularization images from unsplash that I changed to have a max length/height of 2048 pushed the GPU to use shared VRAM during latent caching. Once shared GPU memory is in use, performance will suffer greatly. Latent caching is speedy for maybe 80-100 steps, then it's like the horrible times y'all are sharing (e.g., 10 s/it).

Changing to 1024x1024 images I generated myself from SDXL gets latent caching to a decent 5-8 it/s, and training with whatever my settings are was yielding about 1.2 it/s during training of a standard LoRA.

My numbers might be off a little bit - I can test more later - but check shared GPU memory and make it's not being used. Open Task Manager and check the Performance tab.

yes, I've noticed that a few times, i've seen total VRAM usage for me get into the 50GB range with my 24GB 4090, when that happens iterations/s get around 50seconds/it

Thom293 commented 1 year ago

So confusing. Now it wont use all of my ram. It stops at 14gb. Anyone know a setting that would fix that? The other config I tried used all my ram and got 1.2/its, but it produced black images, and didnt work.

What driver version are yall using? I had read not to update to the latest one, but now I am not so sure.

brianiup commented 1 year ago

people who are having issues, are you on nVIDIA graphics driver 536.67 by any chance? I am and it looks like nVIDIA has a known issue about this

https://us.download.nvidia.com/Windows/536.67/536.67-win11-win10-release-notes.pdf

"This driver implements a fix for creative application stability issues seen during heavy memory usage. We’ve observed some situations where this fix has resulted in performance degradation when running Stable Diffusion and DaVinci Resolve. This will be addressed in an upcoming driver release. [4172676]"

I reverted my drivers to 535.98 studio version and now I am seeing drastically better performance even with buckets enabled I am getting 1.19it/s with 1024x1024 resolution

Thom293 commented 1 year ago

I am having a similar issue, on 4090 SDXL lora training it is going about 1.82s/it and using all 24gb of ram even with a batch size of 1. This is the first SDXL training I have tried, and a new computer with 4090 I have not used for training before so I'm wondering if this speed an RAM use is normal for 4090?

settingsV2.txt

thank you for sharing this. Do you get images from this method? I get only grey or black at 30 epochs. I will try 250.

MartinTremblay commented 1 year ago

Training has become unusable on my 4090. No sure what is happening.

MartinTremblay commented 1 year ago

Every settings is the same, but with 1.5 model

blazelet commented 1 year ago

people who are having issues, are you on nVIDIA graphics driver 536.67 by any chance? I am and it looks like nVIDIA has a known issue about this

https://us.download.nvidia.com/Windows/536.67/536.67-win11-win10-release-notes.pdf

"This driver implements a fix for creative application stability issues seen during heavy memory usage. We’ve observed some situations where this fix has resulted in performance degradation when running Stable Diffusion and DaVinci Resolve. This will be addressed in an upcoming driver release. [4172676]"

I reverted my drivers to 535.98 studio version and now I am seeing drastically better performance even with buckets enabled I am getting 1.19it/s with 1024x1024 resolution

I am on the rtx 4070 and rolled my drivers back to 536.40 - still getting 7.1s/it at this point. Its crazy slow.

6b6a72 commented 1 year ago

Currently running GUI v21.8.6, the latest NVIDIA drivers (536.67), and with this config:

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket
                         --min_bucket_reso=256 --max_bucket_reso=2048
                         --pretrained_model_name_or_path="D:/sd/models/sdxl_v1/sd_xl_base_1.0.safetensors"
                         --train_data_dir="D:/sd/training/zelda/output\img"
                         --reg_data_dir="D:/sd/training/class/sdxl/woman" --resolution="1024,1024"
                         --output_dir="D:/sd/training/zelda/output\model"
                         --logging_dir="D:/sd/training/zelda/output\log" --network_alpha="48"
                         --save_model_as=safetensors --network_module=networks.lora --unet_lr=0.0001
                         --network_train_unet_only --network_dim=96 --output_name="pzel_xl"
                         --lr_scheduler_num_cycles="1" --cache_text_encoder_outputs --no_half_vae --full_bf16
                         --learning_rate="0.0001" --lr_scheduler="adafactor" --train_batch_size="1"
                         --max_train_steps="960" --save_every_n_epochs="1" --mixed_precision="bf16"
                         --save_precision="bf16" --cache_latents --optimizer_type="Adafactor"
                         --max_data_loader_n_workers="0" --bucket_reso_steps=32 --xformers --bucket_no_upscale
                         --noise_offset=0.0357

Current results are:

4.50 it/s average during latent caching
~1.40 it/s during training

For this testing I'm using the following:

12x source images (one snuck in at 1000x1000, but the rest are all 4000x2250, 2880x2880, etc)
reg images pulling from 800 1024x1024 SDXL generated images

End of training: steps: 100%|████████████████████████████████████████████████████████████| 960/960 [11:25<00:00, 1.40it/s, loss=0.0857]

MartinTremblay commented 1 year ago

I am on the rtx 4070 and rolled my drivers back to 536.40 - still getting 7.1s/it at this point. Its crazy slow.

Same here (on rtx 4090) I tried both version. Will try to with a fresh install of Kohya.

MiloMindbender commented 1 year ago

I am having a similar issue, on 4090 SDXL lora training it is going about 1.82s/it and using all 24gb of ram even with a batch size of 1. This is the first SDXL training I have tried, and a new computer with 4090 I have not used for training before so I'm wondering if this speed an RAM use is normal for 4090? settingsV2.txt

thank you for sharing this. Do you get images from this method? I get only grey or black at 30 epochs. I will try 250.

Yes, I did get a usable LORA out of this. The sample images printed during training were extremely bad and mottled looking right up to the last one it generated. When I used the LORA in automatic1111 with SDXL1.0 I got clean images. though they only resembled the person I trained sometimes.

Thom293 commented 1 year ago

I am having a similar issue, on 4090 SDXL lora training it is going about 1.82s/it and using all 24gb of ram even with a batch size of 1. This is the first SDXL training I have tried, and a new computer with 4090 I have not used for training before so I'm wondering if this speed an RAM use is normal for 4090? settingsV2.txt

thank you for sharing this. Do you get images from this method? I get only grey or black at 30 epochs. I will try 250.

Yes, I did get a usable LORA out of this. The sample images printed during training were extremely bad and mottled looking right up to the last one it generated. When I used the LORA in automatic1111 with SDXL1.0 I got clean images. though they only resembled the person I trained sometimes.

thank you! Im going to try it again today. Mind if I ask how many epochs/repeats? The setting has 1 epoch with no maximum.

Thom293 commented 1 year ago

Well, I dont know what changed, but I updated today, ran a LORA with 15 images and no reg images 1 repeat, 1 LR at 300 epoch and got an exact perfect LORA. was getting 2.1 its but it finished in 40 mins, which I consider acceptable. Proably dont need 300, but i dont have time to test the lower versions right now.

EDIT:150-200 are actually all you need. 300 is very stiff and hard to vary from the training data.

Im still using a very old nvidia driver too. So if you havent updated koyha, give that a try.

DarkAlchy commented 1 year ago

people who are having issues, are you on nVIDIA graphics driver 536.67 by any chance? I am and it looks like nVIDIA has a known issue about this

https://us.download.nvidia.com/Windows/536.67/536.67-win11-win10-release-notes.pdf

"This driver implements a fix for creative application stability issues seen during heavy memory usage. We’ve observed some situations where this fix has resulted in performance degradation when running Stable Diffusion and DaVinci Resolve. This will be addressed in an upcoming driver release. [4172676]"

I reverted my drivers to 535.98 studio version and now I am seeing drastically better performance even with buckets enabled I am getting 1.19it/s with 1024x1024 resolution

I am getting shit, absolute dog turd shit, speed on my brand new 4090. 3-5s/it is garbage and it is because in Windows it is 65% and in Linux 80% utilization. I have tried everything, and you may not know but on Linux 535.xx is the max available driver.

sepro commented 1 year ago

Getting somewhat disappointing speeds here as well on a 4080 rtx, about 2.61 s/it. I noticed that when the network_dim is set too high e.g. to 256 it runs out of VRAM and things get really slow (20+ seconds per iteration and worse) as it start using regular RAM. So I needed to limit this to 128, which keeps everything in the 16G VRAM, but the speed is still considerably lower than what some report. Was hoping the speed would go up with a lower rank/dim e.g. 64 but it is the same.

Currently on the latest NVIDIA driver with a fresh Kohya install.

FurkanGozukara commented 1 year ago

Getting somewhat disappointing speeds here as well on a 4080 rtx, about 2.61 s/it. I noticed that when the network_dim is set too high e.g. to 256 it runs out of VRAM and things get really slow (20+ seconds per iteration and worse) as it start using regular RAM. So I needed to limit this to 128, which keeps everything in the 16G VRAM, but the speed is still considerably lower than what some report. Was hoping the speed would go up with a lower rank/dim e.g. 64 but it is the same.

Currently on the latest NVIDIA driver with a fresh Kohya install.

ye this is an issue with RTX 4xxx series

I saw many people reporting and all very bad

DarkAlchy commented 1 year ago

https://github.com/bmaltais/kohya_ss/issues/961#issuecomment-1674843353

FurkanGozukara commented 1 year ago

network_train_unet_only

you are network_train_unet_only

that probably is making difference

DarkAlchy commented 1 year ago

network_train_unet_only

you are network_train_unet_only

that probably is making difference

Nice try but no.

FurkanGozukara commented 1 year ago

network_train_unet_only

you are network_train_unet_only that probably is making difference

Nice try but no.

What is your driver version?

What speed do you get when you train 256 network rank and both text encoder and unet and 1024*1024?

DarkAlchy commented 1 year ago

network_train_unet_only

you are network_train_unet_only that probably is making difference

Nice try but no.

What is your driver version?

What speed do you get when you train 256 network rank and both text encoder and unet and 1024*1024?

I already gave all the answer that is needed in the other comment. Controlled test of A vs B with all being equal otherwise.

FurkanGozukara commented 1 year ago

network_train_unet_only

you are network_train_unet_only that probably is making difference

Nice try but no.

What is your driver version? What speed do you get when you train 256 network rank and both text encoder and unet and 1024*1024?

I already gave all the answer that is needed in the other comment. Controlled test of A vs B with all being equal otherwise.

ah you have ubuntu

DarkAlchy commented 1 year ago

network_train_unet_only

you are network_train_unet_only that probably is making difference

Nice try but no.

What is your driver version? What speed do you get when you train 256 network rank and both text encoder and unet and 1024*1024?

I already gave all the answer that is needed in the other comment. Controlled test of A vs B with all being equal otherwise.

ah you have ubuntu

Yes, but so does Colab. A 4090 should not be slower than a T4 on Colab. Now when I removed Windows from the equation and using the same setup (sd-scripts) on both I was faster but nothing like I should. This isn't a Kohya_ss gui issue this is partly sd-scripts (as I showed in my experiments using pre XL version) and Nvidia (they did say issues will be taken care of in future driver updates).

synystersocks commented 1 year ago

Hi, i had the same issue, win 11, 12700k, 3060ti 8gb, 32gb ddr4, 2tb m.2 (seems helpful with data streaming "suspect resize bar and/or GPUDirect Storage" implamentation currently unknown). i was getting 47s/it now im getting 3.19s/it after a few checks, repairs and installs, im using the latest nvidia gpu drivers 536.99 08/08/23, not tested on older drivers.

in nvidia geforce experiance enable experimental features = true,
enable developer mode on windows 11, just search developer with cortana, no need to go into visual studio,
download and install c++ redistrubution 2015 - 2022 (Visual Studio 2015, 2017, 2019, and 2022) x64 version, if already installed press repair, "i have the 2015-2019 x86 installed also, but i dont think that is required but may be helpful info",
restart after install or repair,
install bits and bytes if not already,
install triton if not already,
if using nvidia, install CUDNN if not already,

for in kohya, main parameters required,

cache text encoder outputs, (this made the most difference after dev mode activation and c++ redistro repair)
add to additional parameters --network_train_unet_only
bf16, training, bf16 or fp16 for saving,
enable full bf16 training,
enable gradient checkpointing,
enable memory efficient attention,
select xformers for cross attention,
enable dont upscale bucket,
optional, resize images to the max res of 1024 to optimize the data required to process(e.g 1024px x 1024px = 1048576 pixels per image, 3840px x 2160px "4k" = 8294400 pixels.) with 4k images, crop to the content, then resize so the largest value is 1024px.

hopefully this helps,

DarkAlchy commented 1 year ago

Hi, i had the same issue, win 11, 12700k, 3060ti 8gb, 32gb ddr4, 2tb m.2 (seems helpful with data streaming "suspect resize bar and/or GPUDirect Storage" implamentation currently unknown). i was getting 47s/it now im getting 3.19s/it after a few checks, repairs and installs, im using the latest nvidia gpu drivers 536.99 08/08/23, not tested on older drivers.

in nvidia geforce experiance enable experimental features = true,

enable developer mode on windows 11, just search developer with cortana, no need to go into visual studio,

download and install c++ redistrubution 2015 - 2022 (Visual Studio 2015, 2017, 2019, and 2022) x64 version, if already installed press repair, "i have the 2015-2019 x86 installed also, but i dont think that is required but may be helpful info",

restart after install or repair,

install bits and bytes if not already,

install triton if not already,

if using nvidia, install CUDNN if not already,

for in kohya, main parameters required,

cache text encoder outputs, (this made the most difference after dev mode activation and c++ redistro repair)

add to additional parameters --network_train_unet_only

bf16, training, bf16 or fp16 for saving,

enable full bf16 training,

enable gradient checkpointing,

enable memory efficient attention,

select xformers for cross attention,

enable dont upscale bucket,

optional, resize images to the max res of 1024 to optimize the data required to process(e.g 1024px x 1024px = 1048576 pixels per image, 3840px x 2160px "4k" = 8294400 pixels.) with 4k images, crop to the content, then resize so the largest value is 1024px.

hopefully this helps,

3090 still beats a 4090 and all of that is how I train now. Use Linux as I have a friend that his exact settings would not, under any circumstances, work in Windows as it was always OOM while I went to Linux, changed the paths and let it rip for it to use about 19.5GB of the 4090.

FurkanGozukara commented 1 year ago

Hi, i had the same issue, win 11, 12700k, 3060ti 8gb, 32gb ddr4, 2tb m.2 (seems helpful with data streaming "suspect resize bar and/or GPUDirect Storage" implamentation currently unknown). i was getting 47s/it now im getting 3.19s/it after a few checks, repairs and installs, im using the latest nvidia gpu drivers 536.99 08/08/23, not tested on older drivers.

in nvidia geforce experiance enable experimental features = true,

enable developer mode on windows 11, just search developer with cortana, no need to go into visual studio,

download and install c++ redistrubution 2015 - 2022 (Visual Studio 2015, 2017, 2019, and 2022) x64 version, if already installed press repair, "i have the 2015-2019 x86 installed also, but i dont think that is required but may be helpful info",

restart after install or repair,

install bits and bytes if not already,

install triton if not already,

if using nvidia, install CUDNN if not already,

for in kohya, main parameters required,

cache text encoder outputs, (this made the most difference after dev mode activation and c++ redistro repair)

add to additional parameters --network_train_unet_only

bf16, training, bf16 or fp16 for saving,

enable full bf16 training,

enable gradient checkpointing,

enable memory efficient attention,

select xformers for cross attention,

enable dont upscale bucket,

optional, resize images to the max res of 1024 to optimize the data required to process(e.g 1024px x 1024px = 1048576 pixels per image, 3840px x 2160px "4k" = 8294400 pixels.) with 4k images, crop to the content, then resize so the largest value is 1024px.

hopefully this helps,

3090 still beats a 4090 and all of that is how I train now. Use Linux as I have a friend that his exact settings would not, under any circumstances, work in Windows as it was always OOM while I went to Linux, changed the paths and let it rip for it to use about 19.5GB of the 4090.

ye RTX 3090 is better performing than 4090

i think this is because of nvidia drivers

underrealSKY commented 1 year ago

This is somewhat disappointing although, this is also my first training ever. I followed Kohya's guide. But 39.03s/it is horrifying. Is there something wrong with my training config? obraz 14:07:12-559494 INFO Valid image folder names found in: H:/KOHYA/TrainingResults/Lora/Duda\img 14:07:12-560495 INFO Folder 20_Andrzej Duda man: 18 images found 14:07:12-573506 INFO Folder 20_Andrzej Duda man: 360 steps 14:07:12-574507 INFO Total steps: 360 14:07:12-575508 INFO Train batch size: 1 14:07:12-575508 INFO Gradient accumulation steps: 1 14:07:12-576509 INFO Epoch: 10 14:07:12-577509 INFO Regulatization factor: 1 14:07:12-577509 INFO max_train_steps (360 / 1 / 1 * 10 * 1) = 3600 14:07:12-578510 INFO stop_text_encoder_training = 0 14:07:12-579511 INFO lr_warmup_steps = 0 14:07:12-580512 INFO Saving training config to H:/KOHYA/TrainingResults/Lora/Duda\model\Andrzej_Duda_20230826-140712.json... 14:07:12-581513 INFO accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="H:/SD/webui/models/Stable-diffusion/sd_xl_base_1.0.safetensors " --train_data_dir="H:/KOHYA/TrainingResults/Lora/Duda\img" --resolution="1024,1024" --output_dir="H:/KOHYA/TrainingResults/Lora/Duda\model" --logging_dir="H:/KOHYA/TrainingResults/Lora/Duda\log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003 --network_dim=256 --output_name="Andrzej_Duda" --lr_scheduler_num_cycles="10" --no_half_vae --learning_rate="0.0003" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="3600" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --bucket_no_upscale --noise_offset=0.0

PS. Don't laugh I am trying it on president of Poland pics, but I was looking for something distinctive and available in high quality images

FurkanGozukara commented 1 year ago

@underrealSKY follow my guide

people are able to get 1-1.5 s / it

https://youtu.be/sBFGitIvD2A

bmaltais / kohya_ss

SDXL Lora training Extremely slow on Rtx 4090 #1288