49 sec/it on 4090 - Githubissues

timmbobb commented 4 months ago

Getting EXTREMELY slow training speed.

49 sec/it on 4090 seems completely unreasonable.

Here's my config: 14:16:06-593487 WARNING Here is the trainer command as a reference. It will not be executed:

14:16:06-594489 INFO C:\StableDiffusion\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --mixed_precision fp16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 C:/StableDiffusion/kohya_ss/sd-scripts/sdxl_train_network.py --config_file C:/StableDiffusion/kohya_ss/dataset/formatted_training_images\model/config_lora-20240521-141606 .toml

14:16:06-595487 INFO Showing toml config file: C:/StableDiffusion/kohya_ss/dataset/formatted_training_images\model/config_lora-20240521-141606 .toml

14:16:06-596487 INFO bucket_no_upscale = true bucket_reso_steps = 64 caption_extension = ".txt" clip_skip = 1 dynamo_backend = "no" enable_bucket = true epoch = 10 gradient_accumulation_steps = 1 huber_c = 0.1 huber_schedule = "snr" learning_rate = 0.0001 logging_dir = "C:/StableDiffusion/kohya_ss/dataset/formatted_training_images\log" loss_type = "l2" lr_scheduler = "cosine_with_restarts" lr_scheduler_args = [] lr_scheduler_num_cycles = 3 lr_scheduler_power = 1 lr_warmup_steps = 388 max_bucket_reso = 2048 max_data_loader_n_workers = 0 max_grad_norm = 1 max_timestep = 1000 max_token_length = 75 max_train_steps = 7770 min_bucket_reso = 256 mixed_precision = "fp16" multires_noise_discount = 0.3 network_alpha = 16 network_args = [] network_dim = 32 network_module = "networks.lora" no_half_vae = true noise_offset_type = "Original" optimizer_args = [ "weight_decay=0.1", "betas=[0.9,0.99]",] optimizer_type = "AdamW8bit" output_dir = "C:/StableDiffusion/kohya_ss/dataset/formatted_training_images\model" output_name = "last" pretrained_model_name_or_path = "stabilityai/stable-diffusion-xl-base-1.0" prior_loss_weight = 1 resolution = "1024,1024" sample_every_n_epochs = 1 sample_prompts = "C:/StableDiffusion/kohya_ss/dataset/formatted_training_images\model\prompt.txt" sample_sampler = "dpmsolver++" save_every_n_epochs = 1 save_model_as = "safetensors" save_precision = "fp16" shuffle_caption = true text_encoder_lr = 2e-5 train_batch_size = 3 train_data_dir = "C:/StableDiffusion/kohya_ss/dataset/formatted_training_images\img" unet_lr = 0.0001 xformers = true

14:16:06-599487 INFO end of toml config file:

timmbobb commented 4 months ago

I feel like I have to be doing something mindblowingly wrong or stupid, because people are saying that 2s/it is slow, and mine is going 25 times slower than that. All help would be appreciated!

b-fission commented 4 months ago

It could be related to high VRAM usage.

Do you have Gradient Checkpointing turned on? It'll greatly reduce VRAM usage but there is a tradeoff in speed.
Try a different optimizer like Adafactor or Lion8bit. Those have lower VRAM usage than AdamW and others.
When VRAM usage goes over the GPU's capacity, nvidia's driver tries to share system RAM as extra VRAM which will slow things down a lot. You could try disabling sysmem fallback in nvidia control panel.

Anecdotally, my RTX 4090 gets around 1.34s/it when training with similar settings as yours.

maybleMyers commented 4 months ago

Your batch size is too high. Like the previous comment states you are using too much vram. You have to keep training under 24gb or it slows way down. For full model fine tunes you need a batch size of 1 and cacheing everything to fit it inside of a 24gb card.

bmaltais / kohya_ss

49 sec/it on 4090 #2520