bmaltais / kohya_ss

Apache License 2.0
9.54k stars 1.23k forks source link

73s/it for SDXL lora training with RTX3080, is it normal ? #1524

Closed Entretoize closed 8 months ago

Entretoize commented 1 year ago

I followed a tutorial to train a lora model with Kohya for SDXL. The best I can get is 73s/it which seems slow but maybe it's normal ? I already tried another tutorial for SD1.5 and it was fast so I think it is an issue.

Here's my cmd history:

11:50:51-006022 INFO     Start training LoRA Standard ...
11:50:51-008022 INFO     Checking for duplicate image filenames in training data directory...
11:50:51-011023 INFO     Valid image folder names found in:
                         E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\img
11:50:51-013022 INFO     Valid image folder names found in:
                         E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\reg
11:50:51-016022 INFO     Folder 20_MyModelName person: 39 images found
11:50:51-018022 INFO     Folder 20_MyModelName person: 780 steps
11:50:51-020022 WARNING  Regularisation images are used... Will double the number of steps required...
11:50:51-022023 INFO     Total steps: 780
11:50:51-023023 INFO     Train batch size: 1
11:50:51-025023 INFO     Gradient accumulation steps: 1
11:50:51-026022 INFO     Epoch: 8
11:50:51-028022 INFO     Regulatization factor: 2
11:50:51-029022 INFO     max_train_steps (780 / 1 / 1 * 8 * 2) = 12480
11:50:51-031022 INFO     stop_text_encoder_training = 0
11:50:51-032023 INFO     lr_warmup_steps = 0
11:50:51-035023 INFO     Saving training config to
                         E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\model\MyModelName_20230918-115
                         051.json...
11:50:51-039026 INFO     accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
                         --pretrained_model_name_or_path="E:/MyData/Logiciels/stable-diffusion-webui-1.6.0/SD
                         Gui/models/Stable-diffusion/sd_xl_base_1.0.safetensors"
                         --train_data_dir="E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\img"
                         --reg_data_dir="E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\reg"
                         --resolution="1024,1024"
                         --output_dir="E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\model"
                         --logging_dir="E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\log"
                         --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora
                         --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=19 --output_name="MyModelName"
                         --lr_scheduler_num_cycles="8" --no_half_vae --full_bf16 --learning_rate="0.0004"
                         --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="12480"
                         --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents
                         --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False
                         relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64
                         --save_state --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0
                         --network_train_unet_only
prepare tokenizers
Using DreamBooth method.
prepare images.
found directory E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\img\20_MyModelName person contains 39 image files
No caption file found for 39 images. Training will continue without captions for these images. If class token exists, it will be used. / 39枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を 続行します。class tokenが存在する場合はそれを使います。
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\img\20_MyModelName person\image-134.JPG
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\img\20_MyModelName person\image-135.JPG
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\img\20_MyModelName person\image-136.JPG
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\img\20_MyModelName person\image-137.JPG
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\img\20_MyModelName person\image-138.JPG
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\img\20_MyModelName person\image-151.JPG... and 34 more
found directory E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\reg\1_person contains 279 image files
No caption file found for 279 images. Training will continue without captions for these images. If class token exists, it will be used. / 279枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習 を続行します。class tokenが存在する場合はそれを使います。
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\reg\1_person\person_ddim_00001_.png
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\reg\1_person\person_ddim_00002_.png
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\reg\1_person\person_ddim_00008_.png
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\reg\1_person\person_ddim_00012_.png
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\reg\1_person\person_ddim_00013_.png
E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\reg\1_person\person_ddim_00015_.png... and 274 more
780 train images with repeating.
279 reg images.
[Dataset 0]
  batch_size: 1
  resolution: (1024, 1024)
  enable_bucket: False

  [Subset 0 of Dataset 0]
    image_dir: "E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\img\20_MyModelName person"
    image_count: 39
    num_repeats: 20
    shuffle_caption: False
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: MyModelName person
    caption_extension: .caption

  [Subset 1 of Dataset 0]
    image_dir: "E:\MyData\Logiciels\stable-diffusion-webui-1.6.0\TrainingOut\reg\1_person"
    image_count: 279
    num_repeats: 1
    shuffle_caption: False
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: True
    class_tokens: person
    caption_extension: .caption

[Dataset 0]
loading image sizes.
100%|██████████████████████████████████████████████████████████████████████████████| 318/318 [00:00<00:00, 3087.44it/s]
prepare dataset
preparing accelerator
loading model for process 0/1
load StableDiffusion checkpoint: E:/MyData/Logiciels/stable-diffusion-webui-1.6.0/SD Gui/models/Stable-diffusion/sd_xl_base_1.0.safetensors
building U-Net
loading U-Net from checkpoint
U-Net:  <All keys matched successfully>
building text encoders
loading text encoders from checkpoint
text encoder 1: <All keys matched successfully>
text encoder 2: <All keys matched successfully>
building VAE
loading VAE from checkpoint
VAE: <All keys matched successfully>
Enable xformers for U-Net
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
import network module: networks.lora
[Dataset 0]
caching latents.
checking cache validity...
100%|████████████████████████████████████████████████████████████████████████████████| 318/318 [00:03<00:00, 86.78it/s]
caching latents...
0it [00:00, ?it/s]
create LoRA network. base dim (rank): 19, alpha: 1.0
neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
create LoRA for Text Encoder 1:
create LoRA for Text Encoder 2:
create LoRA for Text Encoder: 264 modules.
create LoRA for U-Net: 722 modules.
enable LoRA for U-Net
prepare optimizer, data loader etc.
use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False}
because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / max_grad_normが設定されているためclip_grad_normが有効になります。0に設定して無効にしたほうがいいかもしれません
constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません
enable full bf16 training.
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 780
  num reg images / 正則化画像の数: 279
  num batches per epoch / 1epochのバッチ数: 1560
  num epochs / epoch数: 8
  batch size per device / バッチサイズ: 1
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 12480
steps:   0%|                                                                                 | 0/12480 [00:00<?, ?it/s]
epoch 1/8
steps:   0%|▏                                                     | 50/12480 [1:01:55<256:36:01, 74.32s/it, loss=0.123]

I have an RTX 3080 GPU (10GB+16GB shared) and while training it uses the whole dedicated memory plus 2.5GB of the shared one but from my other tests with SD1.5 it doesn't seems to be a problem.

From what I read, speed should be 1 or 2s/it, I also read about a guy that reinstalled some package and solved this problem, but what to reinstall ? Other guys said they had this kind of issue with the last version, I'm confused...

2blackbar commented 1 year ago

Of course its not, tons of people have training broken and slow speeds like this, nobody has answer to this, its only with sdxl which is supposed to be easy to train

sangoi-exe commented 1 year ago

Not a bug, your parameters are using more VRAM than your GPU has, so the training script starts using regular RAM as well, which makes the training incredibly slow. Review your parameters, optimizer, etc.

Entretoize commented 1 year ago

After applying all optimization listed on the kohya website, I'm able to keep vram usage below 9GB and traing run at 7s/it, so it seems you where true. Thanks !

JAssertz commented 1 year ago

@Entretoize go to https://www.nvidia.com/en-us/geforce/drivers select your gpu and downgrade to version 531 and that will fix your issue I had a new ver b4 and had like 6-12 s/it now its down to like ~1.2-.09s/it

Entretoize commented 1 year ago

I can't confirm, I'm still at 7s/it with a 531 version, maybe it depends on the RTX you have...

StaffanJOlsson commented 10 months ago

@Entretoize go to https://www.nvidia.com/en-us/geforce/drivers select your gpu and downgrade to version 531 and that will fix your issue I had a new ver b4 and had like 6-12 s/it now its down to like ~1.2-.09s/it

This did it for me, went from 12.4s/it to 1.9s/it However, 531 is now only available under studio drivers.

SchemingWeasel1 commented 10 months ago

Version 531 also worked for my RTX 3090. Jumped from 12it/s to 1.3it/s. Thanks for the tip!