bmaltais / kohya_ss

Apache License 2.0
8.9k stars 1.16k forks source link

EPOCH ERROR #2430

Open eapolo opened 2 months ago

eapolo commented 2 months ago

Despite I'm putting 10 epochs for training it's like only one epoch is execute and I do not know why?

The output of during the training

22:06:56-253531 INFO Total steps: 3960 22:06:56-254528 INFO Train batch size: 1 22:06:56-255796 INFO Gradient accumulation steps: 1 22:06:56-258526 INFO Epoch: 10 22:06:56-263380 INFO Max train steps: 1600 22:06:56-265377 INFO stop_text_encoder_training = 0 22:06:56-268733 INFO lr_warmup_steps = 0 22:06:56-271754 INFO Saving training config to C:/EnriqueModels/kohya_ss/satelite_images/data_training/model\last_20240501-220656.json... 22:06:56-275753 INFO Executing command: "C:\EnriqueModels\kohya_ss\venv\Scripts\accelerate.EXE" launch --dynamo_backend no --dynamo_mode default --mixed_precision fp16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 "C:/EnriqueModels/kohya_ss/sd-scripts/train_network.py" --config_file "./outputs/config_lora-20240501-220656.toml" with shell=True 22:06:56-287763 INFO Command executed. 2024-05-01 22:07:09 WARNING A matching Triton is not available, some optimizations will not be enabled. init.py:55 Error caught was: No module named 'triton' 2024-05-01 22:07:14 INFO Loading settings from ./outputs/config_lora-20240501-220656.toml... train_util.py:3744 INFO ./outputs/config_lora-20240501-220656 train_util.py:37632024-05-01 22:07:14 INFO prepare tokenizer train_util.py:4227 INFO update token length: 75 train_util.py:4244 INFO Using DreamBooth method. train_network.py:172 INFO prepare images. train_util.py:1572 INFO found directory train_util.py:1519 C:\EnriqueModels\kohya_ss\satelite_images\data_training\img\40_satellite _images image_of contains 99 image files 2024-05-01 22:07:15 INFO 3960 train images with repeating. train_util.py:1613 INFO 0 reg images. train_util.py:1616 WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1621 INFO [Dataset 0] config_util.py:565 batch_size: 1 resolution: (512, 512) enable_bucket: True network_multiplier: 1.0 min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True

                           [Subset 0 of Dataset 0]
                             image_dir:
                         "C:\EnriqueModels\kohya_ss\satelite_images\data_training\img\40_satellit
                         e_images image_of"
                             image_count: 99
                             num_repeats: 40
                             shuffle_caption: False
                             keep_tokens: 0
                             keep_tokens_separator:
                             secondary_separator: None
                             enable_wildcard: False
                             caption_dropout_rate: 0.0
                             caption_dropout_every_n_epoches: 0
                             caption_tag_dropout_rate: 0.0
                             caption_prefix: None
                             caption_suffix: None
                             color_aug: False
                             flip_aug: False
                             face_crop_aug_range: None
                             random_crop: False
                             token_warmup_min: 1,
                             token_warmup_step: 0,
                             is_reg: False
                             class_tokens: satellite_images image_of
                             caption_extension: .txt

                INFO     [Dataset 0]                                                              config_util.py:571                    INFO     loading image sizes.                                                      train_util.py:853100%|█████████████████████████████████████████████████████████████████████████████████| 99/99 [00:00<00:00, 193.17it/s]

2024-05-01 22:07:16 INFO make buckets train_util.py:859 WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is train_util.py:876 set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計 算されるため、min_bucket_resoとmax_bucket_resoは無視されます INFO number of images (including repeats) / train_util.py:905 各bucketの画像枚数(繰り返し回数を含 む) INFO bucket 0: resolution (128, 128), count: 3960 train_util.py:910 INFO mean ar error (without repeats): 0.0 train_util.py:915 INFO preparing accelerator train_network.py:225accelerator device: cuda INFO loading model for process 0/1 train_util.py:4385 INFO load Diffusers pretrained models: runwayml/stable-diffusion-v1-5 train_util.py:4347Loading pipeline components...: 100%|████████████████████████████████████████████████████| 5/5 [00:00<00:00, 8.37it/s] You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . 2024-05-01 22:07:17 INFO UNet2DConditionModel: 64, 8, 768, False, False original_unet.py:13872024-05-01 22:07:24 INFO U-Net converted to original U-Net train_util.py:43722024-05-01 22:07:25 INFO Enable xformers for U-Net train_util.py:2660import network module: networks.lora INFO [Dataset 0] train_util.py:2079 INFO caching latents. train_util.py:974 INFO checking cache validity... train_util.py:984100%|███████████████████████████████████████████████████████████████████████████████| 99/99 [00:00<00:00, 12344.26it/s] INFO caching latents... train_util.py:1021100%|██████████████████████████████████████████████████████████████████████████████████| 99/99 [00:22<00:00, 4.47it/s] 2024-05-01 22:07:48 INFO create LoRA network. base dim (rank): 8, alpha: 1 lora.py:810 INFO neuron dropout: p=None, rank dropout: p=None, module dropout: p=None lora.py:811 INFO create LoRA for Text Encoder: lora.py:905 INFO create LoRA for Text Encoder: 72 modules. lora.py:910 INFO create LoRA for U-Net: 192 modules. lora.py:918 INFO enable LoRA for text encoder lora.py:9612024-05-01 22:07:49 INFO enable LoRA for U-Net lora.py:966 INFO CrossAttnDownBlock2D False -> True original_unet.py:1521 INFO CrossAttnDownBlock2D False -> True original_unet.py:1521 INFO CrossAttnDownBlock2D False -> True original_unet.py:1521 INFO DownBlock2D False -> True original_unet.py:1521 INFO UNetMidBlock2DCrossAttn False -> True original_unet.py:1521 INFO UpBlock2D False -> True original_unet.py:1521 INFO CrossAttnUpBlock2D False -> True original_unet.py:1521 INFO CrossAttnUpBlock2D False -> True original_unet.py:1521 INFO CrossAttnUpBlock2D False -> True original_unet.py:1521prepare optimizer, data loader etc. INFO use Adafactor optimizer | {'scale_parameter': False, 'relative_step': train_util.py:4047 False, 'warmup_init': False} WARNING because max_grad_norm is set, clip_grad_norm is enabled. consider set to train_util.py:4075 0 / max_grad_normが設定されているためclip_grad_normが有効になります。0に設定 して無効にしたほうがいいかもしれません WARNING constant_with_warmup will be good / train_util.py:4079 スケジューラはconstant_with_warmupが 良いかもしれません running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 3960 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 3960 num epochs / epoch数: 1 batch size per device / バッチサイズ: 1 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1600 steps: 0%| | 0/1600 [00:00<?, ?it/s] epoch 1/1 C:\EnriqueModels\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( steps: 48%|███████████████████████████▏ | 776/1600 [09:15<09:50, 1.40it/s, avr_loss=0.187]

bmaltais commented 2 months ago

Hummm,.. possibly try increasing the max_train_steps to a high enough number to get all the training steps... like maybe 16000000...