bmaltais / kohya_ss

Apache License 2.0
9.32k stars 1.2k forks source link

subprocess.CalledProcessError: Command #2556

Open FurkanGozukara opened 3 months ago

FurkanGozukara commented 3 months ago

It works so many epochs but then randomly fails like below. Any ideas?

Windows 11, Python 3.10.11 fresh install

I think this is related to newest process calling system. This below failed training saved 2 checkpoints and trained 39 epoch before randomly failing. Random fails happens frequently.

13:25:11-098720 INFO     Kohya_ss GUI version: v24.1.4
13:25:11-402144 INFO     Submodule initialized and updated.
13:25:11-404148 INFO     nVidia toolkit detected
13:25:12-421883 INFO     Torch 2.1.2+cu118
13:25:12-429883 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700
13:25:12-430883 INFO     Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128
13:25:12-434331 INFO     Python version is 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit
                         (AMD64)]
13:25:12-435330 INFO     Verifying modules installation status from requirements_pytorch_windows.txt...
13:25:12-436330 INFO     Verifying modules installation status from requirements_windows.txt...
13:25:12-437331 INFO     Verifying modules installation status from requirements.txt...
13:25:17-079487 INFO     headless: False
13:25:17-108541 INFO     Using shell=True when running external commands...
Running on local URL:  http://127.0.0.1:7860
13:32:45-543750 INFO     Start training LoRA Standard ...
13:32:45-545750 INFO     Validating lr scheduler arguments...
13:32:45-546750 INFO     Validating optimizer arguments...
13:32:45-548750 INFO     Validating C:/test_kohya existence and writability... SUCCESS
13:32:45-549750 INFO     Validating C:/ComfyUI_windows_portable/ComfyUI/models/checkpoints/sd_xl_base_1.0.safetensors
                         existence... SUCCESS
13:32:45-550749 INFO     Validating C:/Users/RENDA/Pictures/31maymodel\img existence... SUCCESS
13:32:45-551893 INFO     Folder 1_ohwx style: 1 repeats found
13:32:45-552893 INFO     Folder 1_ohwx style: 51 images found
13:32:45-553893 INFO     Folder 1_ohwx style: 51 * 1 = 51 steps
13:32:45-553893 INFO     Regulatization factor: 1
13:32:45-554893 INFO     Total steps: 51
13:32:45-555893 INFO     Train batch size: 1
13:32:45-556894 INFO     Gradient accumulation steps: 1
13:32:45-557893 INFO     Epoch: 150
13:32:45-558893 INFO     max_train_steps (51 / 1 / 1 * 150 * 1) = 7650
13:32:45-559893 INFO     stop_text_encoder_training = 0
13:32:45-560894 INFO     lr_warmup_steps = 0
13:32:45-562919 INFO     Saving training config to C:/test_kohya\last_20240531-133245.json...
13:32:45-564920 INFO     Executing command: C:\kohya_new\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no
                         --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1
                         --num_cpu_threads_per_process 2 C:/kohya_new/kohya_ss/sd-scripts/sdxl_train_network.py
                         --config_file C:/test_kohya/config_lora-20240531-133245.toml
13:32:45-568920 INFO     Command executed.
2024-05-31 13:32:51 INFO     Loading settings from C:/test_kohya/config_lora-20240531-133245.toml...  train_util.py:3744
                    INFO     C:/test_kohya/config_lora-20240531-133245                                train_util.py:3763
2024-05-31 13:32:51 INFO     prepare tokenizers                                                   sdxl_train_util.py:134
2024-05-31 13:32:52 INFO     update token length: 75                                              sdxl_train_util.py:159
                    INFO     Using DreamBooth method.                                               train_network.py:172
                    INFO     prepare images.                                                          train_util.py:1572
                    INFO     found directory C:\Users\RENDA\Pictures\31maymodel\img\1_ohwx style      train_util.py:1519
                             contains 51 image files
                    WARNING  No caption file found for 51 images. Training will continue without      train_util.py:1550
                             captions for these images. If class token exists, it will be used. /
                             51枚の画像にキャプションファイルが見つかりませんでした。これらの画像につ
                             いてはキャプションなしで学習を続行します。class
                             tokenが存在する場合はそれを使います。
                    WARNING  C:\Users\RENDA\Pictures\31maymodel\img\1_ohwx style\1.png                train_util.py:1557
                    WARNING  C:\Users\RENDA\Pictures\31maymodel\img\1_ohwx style\10.png               train_util.py:1557
                    WARNING  C:\Users\RENDA\Pictures\31maymodel\img\1_ohwx style\11.png               train_util.py:1557
                    WARNING  C:\Users\RENDA\Pictures\31maymodel\img\1_ohwx style\12.png               train_util.py:1557
                    WARNING  C:\Users\RENDA\Pictures\31maymodel\img\1_ohwx style\13.png               train_util.py:1557
                    WARNING  C:\Users\RENDA\Pictures\31maymodel\img\1_ohwx style\14.png... and 46     train_util.py:1555
                             more
                    INFO     51 train images with repeating.                                          train_util.py:1613
                    INFO     0 reg images.                                                            train_util.py:1616
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1621
                    INFO     [Dataset 0]                                                              config_util.py:565
                               batch_size: 1
                               resolution: (1024, 1024)
                               enable_bucket: False
                               network_multiplier: 1.0

                               [Subset 0 of Dataset 0]
                                 image_dir: "C:\Users\RENDA\Pictures\31maymodel\img\1_ohwx style"
                                 image_count: 51
                                 num_repeats: 1
                                 shuffle_caption: False
                                 keep_tokens: 0
                                 keep_tokens_separator:
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: ohwx style
                                 caption_extension: .txt

                    INFO     [Dataset 0]                                                              config_util.py:571
                    INFO     loading image sizes.                                                      train_util.py:853
100%|███████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 48417.72it/s]
                    INFO     prepare dataset                                                           train_util.py:861
                    WARNING  clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません   sdxl_train_util.py:343
                    INFO     preparing accelerator                                                  train_network.py:225
accelerator device: cuda
                    INFO     loading model for process 0/1                                         sdxl_train_util.py:30
                    INFO     load StableDiffusion checkpoint:                                      sdxl_train_util.py:70
                             C:/ComfyUI_windows_portable/ComfyUI/models/checkpoints/sd_xl_base_1.0
                             .safetensors
                    INFO     building U-Net                                                       sdxl_model_util.py:192
                    INFO     loading U-Net from checkpoint                                        sdxl_model_util.py:196
2024-05-31 13:33:05 INFO     U-Net: <All keys matched successfully>                               sdxl_model_util.py:202
                    INFO     building text encoders                                               sdxl_model_util.py:205
                    INFO     loading text encoders from checkpoint                                sdxl_model_util.py:258
2024-05-31 13:33:06 INFO     text encoder 1: <All keys matched successfully>                      sdxl_model_util.py:272
2024-05-31 13:33:11 INFO     text encoder 2: <All keys matched successfully>                      sdxl_model_util.py:276
                    INFO     building VAE                                                         sdxl_model_util.py:279
                    INFO     loading VAE from checkpoint                                          sdxl_model_util.py:284
                    INFO     VAE: <All keys matched successfully>                                 sdxl_model_util.py:287
                    INFO     Enable xformers for U-Net                                                train_util.py:2660
import network module: networks.lora
2024-05-31 13:33:12 INFO     [Dataset 0]                                                              train_util.py:2079
                    INFO     caching latents.                                                          train_util.py:974
                    INFO     checking cache validity...                                                train_util.py:984
100%|████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 2751.92it/s]
                    INFO     caching latents...                                                       train_util.py:1021
0it [00:00, ?it/s]
2024-05-31 13:33:13 INFO     create LoRA network. base dim (rank): 128, alpha: 1                             lora.py:810
                    INFO     neuron dropout: p=None, rank dropout: p=None, module dropout: p=None            lora.py:811
                    INFO     create LoRA for Text Encoder 1:                                                 lora.py:902
2024-05-31 13:33:14 INFO     create LoRA for Text Encoder 2:                                                 lora.py:902
                    INFO     create LoRA for Text Encoder: 264 modules.                                      lora.py:910
2024-05-31 13:33:16 INFO     create LoRA for U-Net: 722 modules.                                             lora.py:918
                    INFO     enable LoRA for text encoder                                                    lora.py:961
                    INFO     enable LoRA for U-Net                                                           lora.py:966
prepare optimizer, data loader etc.
                    INFO     use Adafactor optimizer | {'scale_parameter': False, 'relative_step':    train_util.py:4047
                             False, 'warmup_init': False}
                    WARNING  because max_grad_norm is set, clip_grad_norm is enabled. consider set to train_util.py:4075
                             0 /
                             max_grad_normが設定されているためclip_grad_normが有効になります。0に設定
                             して無効にしたほうがいいかもしれません
                    WARNING  constant_with_warmup will be good /                                      train_util.py:4079
                             スケジューラはconstant_with_warmupが良いかもしれません
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 51
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 51
  num epochs / epoch数: 150
  batch size per device / バッチサイズ: 1
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 7650
steps:   0%|                                                                                  | 0/7650 [00:00<?, ?it/s]
epoch 1/150
steps:   1%|▎                                                     | 48/7650 [00:45<2:00:39,  1.05it/s, avr_loss=0.0818]13:34:13-878020 ERROR    Training is already running. Can't start another training session.
steps:   1%|▎                                                     | 51/7650 [00:48<2:00:30,  1.05it/s, avr_loss=0.0818]
epoch 2/150
steps:   1%|▋                                                    | 102/7650 [01:45<2:09:49,  1.03s/it, avr_loss=0.0873]
epoch 3/150
steps:   2%|█                                                    | 153/7650 [02:32<2:04:11,  1.01it/s, avr_loss=0.0942]
epoch 4/150
steps:   3%|█▍                                                    | 204/7650 [03:23<2:03:48,  1.00it/s, avr_loss=0.113]
epoch 5/150
steps:   3%|█▊                                                   | 255/7650 [04:07<1:59:36,  1.03it/s, avr_loss=0.0963]
epoch 6/150
steps:   4%|██                                                   | 306/7650 [04:52<1:57:05,  1.05it/s, avr_loss=0.0968]
epoch 7/150
steps:   5%|██▍                                                  | 357/7650 [05:35<1:54:13,  1.06it/s, avr_loss=0.0805]
epoch 8/150
steps:   5%|██▉                                                   | 408/7650 [06:19<1:52:18,  1.07it/s, avr_loss=0.115]
epoch 9/150
steps:   6%|███▏                                                  | 459/7650 [07:05<1:51:00,  1.08it/s, avr_loss=0.106]
epoch 10/150
steps:   7%|███▌                                                  | 510/7650 [07:49<1:49:36,  1.09it/s, avr_loss=0.106]
epoch 11/150
steps:   7%|███▉                                                 | 561/7650 [08:35<1:48:37,  1.09it/s, avr_loss=0.0889]
epoch 12/150
steps:   8%|████▏                                                | 598/7650 [09:09<1:48:05,  1.09it/s, avr_loss=0.0871]13:42:38-590692 INFO     Save...
steps:   8%|████▏                                                | 612/7650 [09:29<1:49:10,  1.07it/s, avr_loss=0.0907]
epoch 13/150
steps:   9%|████▊                                                   | 663/7650 [10:41<1:52:37,  1.03it/s, avr_loss=0.1]
epoch 14/150
steps:   9%|████▉                                                | 714/7650 [12:22<2:00:17,  1.04s/it, avr_loss=0.0662]
epoch 15/150
steps:  10%|█████▎                                               | 765/7650 [14:07<2:07:05,  1.11s/it, avr_loss=0.0978]
saving checkpoint: C:/test_kohya\last-000015.safetensors

epoch 16/150
steps:  11%|█████▋                                               | 816/7650 [15:56<2:13:30,  1.17s/it, avr_loss=0.0724]
epoch 17/150
steps:  11%|██████                                               | 867/7650 [16:47<2:11:25,  1.16s/it, avr_loss=0.0902]
epoch 18/150
steps:  12%|██████▎                                              | 918/7650 [17:32<2:08:38,  1.15s/it, avr_loss=0.0977]
epoch 19/150
steps:  13%|██████▋                                              | 969/7650 [18:17<2:06:09,  1.13s/it, avr_loss=0.0878]
epoch 20/150
steps:  13%|██████▉                                             | 1020/7650 [19:03<2:03:52,  1.12s/it, avr_loss=0.0911]
epoch 21/150
steps:  14%|███████▍                                             | 1071/7650 [19:46<2:01:28,  1.11s/it, avr_loss=0.109]
epoch 22/150
steps:  15%|███████▊                                             | 1122/7650 [20:33<1:59:34,  1.10s/it, avr_loss=0.089]
epoch 23/150
steps:  15%|████████▏                                            | 1173/7650 [21:15<1:57:24,  1.09s/it, avr_loss=0.142]
epoch 24/150
steps:  16%|████████▊                                              | 1224/7650 [21:59<1:55:24,  1.08s/it, avr_loss=0.1]
epoch 25/150
steps:  17%|████████▊                                            | 1275/7650 [22:43<1:53:39,  1.07s/it, avr_loss=0.076]
epoch 26/150
steps:  17%|█████████▏                                           | 1326/7650 [23:28<1:51:57,  1.06s/it, avr_loss=0.117]
epoch 27/150
steps:  18%|█████████▌                                           | 1377/7650 [24:14<1:50:25,  1.06s/it, avr_loss=0.118]
epoch 28/150
steps:  19%|█████████▉                                           | 1428/7650 [24:59<1:48:54,  1.05s/it, avr_loss=0.131]
epoch 29/150
steps:  19%|██████████                                          | 1479/7650 [25:44<1:47:24,  1.04s/it, avr_loss=0.0985]
epoch 30/150
steps:  20%|██████████▌                                          | 1530/7650 [26:29<1:45:58,  1.04s/it, avr_loss=0.146]
saving checkpoint: C:/test_kohya\last-000030.safetensors

epoch 31/150
steps:  21%|██████████▉                                          | 1581/7650 [27:16<1:44:43,  1.04s/it, avr_loss=0.102]
epoch 32/150
steps:  21%|███████████                                         | 1632/7650 [28:00<1:43:18,  1.03s/it, avr_loss=0.0959]
epoch 33/150
steps:  22%|███████████▍                                        | 1683/7650 [28:46<1:41:59,  1.03s/it, avr_loss=0.0793]
epoch 34/150
steps:  23%|███████████▊                                        | 1734/7650 [29:29<1:40:36,  1.02s/it, avr_loss=0.0993]
epoch 35/150
steps:  23%|████████████▏                                       | 1785/7650 [30:12<1:39:16,  1.02s/it, avr_loss=0.0905]
epoch 36/150
steps:  24%|████████████▍                                       | 1836/7650 [30:57<1:38:01,  1.01s/it, avr_loss=0.0864]
epoch 37/150
steps:  25%|████████████▊                                       | 1887/7650 [31:41<1:36:46,  1.01s/it, avr_loss=0.0934]
epoch 38/150
steps:  25%|█████████████▍                                       | 1938/7650 [32:27<1:35:39,  1.00s/it, avr_loss=0.103]
epoch 39/150
steps:  26%|█████████████▍                                      | 1985/7650 [33:07<1:34:33,  1.00s/it, avr_loss=0.0928]Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\kohya_new\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
  File "C:\kohya_new\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "C:\kohya_new\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "C:\kohya_new\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\kohya_new\\kohya_ss\\venv\\Scripts\\python.exe', 'C:/kohya_new/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', 'C:/test_kohya/config_lora-20240531-133245.toml']' returned non-zero exit status 3221225477.
14:06:40-973914 INFO     Training has ended.
FurkanGozukara commented 3 months ago

@bmaltais

And another error is this one. This randomly happens. After trying several time it starts working somehow

16:09:07-025674 INFO     Loading config...
16:09:15-470598 INFO     Save...
16:10:33-394977 INFO     Save...
16:10:36-458435 INFO     Start training Dreambooth...
16:10:36-462434 INFO     Validating lr scheduler arguments...
16:10:36-463433 INFO     Validating optimizer arguments...
16:10:36-463433 INFO     Validating C:/test_kohya/DreamBooth existence and writability... SUCCESS
16:10:36-464433 INFO     Validating C:/ComfyUI_windows_portable/ComfyUI/models/checkpoints/sd_xl_base_1.0.safetensors
                         existence... SUCCESS
16:10:36-466433 INFO     Validating C:/Users/RENDA/Pictures/31maymodel\img existence... SUCCESS
16:10:36-466433 INFO     Folder 1_ohwx style: 1 repeats found
16:10:36-467433 INFO     Folder 1_ohwx style: 51 images found
16:10:36-468432 INFO     Folder 1_ohwx style: 51 * 1 = 51 steps
16:10:36-468432 INFO     Regulatization factor: 1
16:10:36-469433 INFO     Total steps: 51
16:10:36-469433 INFO     Train batch size: 1
16:10:36-470433 INFO     Gradient accumulation steps: 1
16:10:36-470433 INFO     Epoch: 150
16:10:36-471432 INFO     max_train_steps (51 / 1 / 1 * 150 * 1) = 7650
16:10:36-471432 INFO     lr_warmup_steps = 0
16:10:36-473433 INFO     Saving training config to C:/test_kohya/DreamBooth\last_20240531-161036.json...
16:10:36-475434 INFO     Executing command: C:\kohya_new\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no
                         --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1
                         --num_cpu_threads_per_process 2 C:/kohya_new/kohya_ss/sd-scripts/sdxl_train.py --config_file
                         C:/test_kohya/DreamBooth/config_dreambooth-20240531-161036.toml
16:10:36-478432 INFO     Command executed.
2024-05-31 16:10:42 INFO     Loading settings from                                                    train_util.py:3744
                             C:/test_kohya/DreamBooth/config_dreambooth-20240531-161036.toml...
                    INFO     C:/test_kohya/DreamBooth/config_dreambooth-20240531-161036               train_util.py:3763
                    WARNING  clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません   sdxl_train_util.py:343
2024-05-31 16:10:42 INFO     prepare tokenizers                                                   sdxl_train_util.py:134
                    INFO     update token length: 75                                              sdxl_train_util.py:159
                    INFO     Using DreamBooth method.                                                  sdxl_train.py:144
Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\kohya_new\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
  File "C:\kohya_new\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "C:\kohya_new\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "C:\kohya_new\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\kohya_new\\kohya_ss\\venv\\Scripts\\python.exe', 'C:/kohya_new/kohya_ss/sd-scripts/sdxl_train.py', '--config_file', 'C:/test_kohya/DreamBooth/config_dreambooth-20240531-161036.toml']' returned non-zero exit status 3221225477.
kappaman00 commented 2 months ago

I am getting the exact same error. Did you find a solution?