bmaltais / kohya_ss

Apache License 2.0
9.37k stars 1.21k forks source link

/train_network.py /tmpfilelora.toml returned non-zero exit status 1. #2317

Open mperez96 opened 5 months ago

mperez96 commented 5 months ago

Hi! I did an upgrade this morning after watching a new version was out; I followed the instructions in the repo (git pull, .\setup.bat choosing option 1), but when I tried to test a random lora creation using an existing configuration.json (I just modified the folders) but I got the following error, is there something I need to do with this new version? Hope you or someone can help.

I'll share the terminal:

12:35:31-996696 INFO     Kohya_ss GUI version: v24.0.2
12:35:32-604420 INFO     Submodule initialized and updated.
12:35:32-610420 INFO     nVidia toolkit detected
12:35:49-956850 INFO     Torch 2.1.2+cu118
12:35:50-004468 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700
12:35:50-022960 INFO     Torch detected GPU: NVIDIA GeForce RTX 3060 VRAM 12288 Arch (8, 6) Cores 28
12:35:50-069566 INFO     Python version is 3.10.9 (tags/v3.10.9:1dd9be6, Dec  6 2022, 20:01:21) [MSC v.1934 64 bit
                         (AMD64)]
12:35:50-074557 INFO     Verifying modules installation status from requirements_pytorch_windows.txt...
12:35:50-088556 INFO     Verifying modules installation status from requirements_windows.txt...
12:35:50-094564 INFO     Verifying modules installation status from requirements.txt...
12:36:10-556496 INFO     headless: False
12:36:10-643184 INFO     Using shell=True when running external commands...
Running on local URL:  http://127.0.0.1:7860

Thanks for being a Gradio user! If you have questions or feedback, please join our Discord server and chat with us: https://discord.gg/feTf9x3ZSB

To create a public link, set `share=True` in `launch()`.
12:36:27-099942 INFO     Loading config...
12:36:59-164020 INFO     Start training LoRA Standard ...
12:36:59-166020 INFO     Validating model file or folder path
                         D:/Users/USER/Documents/AI/stable-diffusion-webui/models/Stable-diffusion/training-base-model.safetensors existence...
12:36:59-167021 INFO     ...valid
12:36:59-168020 INFO     Validating output_dir path D:\Users\USER\Pictures\AI\training\SD1-5\cl\v3.4\model existence...
12:36:59-169021 INFO     ...valid
12:36:59-170021 INFO     Validating train_data_dir path D:\Users\USER\Pictures\AI\training\SD1-5\cl\v3.4\img
                         existence...
12:36:59-171020 INFO     ...valid
12:36:59-172020 INFO     reg_data_dir not specified, skipping validation
12:36:59-173020 INFO     Validating logging_dir path D:\Users\USER\Pictures\AI\training\SD1-5\cl\v3.4\log existence...
12:36:59-174020 INFO     ...valid
12:36:59-175022 INFO     log_tracker_config not specified, skipping validation
12:36:59-176021 INFO     resume not specified, skipping validation
12:36:59-179020 INFO     vae not specified, skipping validation
12:36:59-181020 INFO     lora_network_weights not specified, skipping validation
12:36:59-182020 INFO     dataset_config not specified, skipping validation
12:36:59-183020 INFO     Folder 100_lorahd woman: 100 repeats found
12:36:59-276022 INFO     Folder 100_lorahd woman: 159 images found
12:36:59-277022 INFO     Folder 100_lorahd woman: 159 * 100 = 15900 steps
12:36:59-280022 INFO     Regulatization factor: 1
12:36:59-281020 INFO     Total steps: 15900
12:36:59-282020 INFO     Train batch size: 2
12:36:59-283020 INFO     Gradient accumulation steps: 1
12:36:59-284019 INFO     Epoch: 1
12:36:59-284019 INFO     max_train_steps (15900 / 2 / 1 * 1 * 1) = 7950
12:36:59-286020 INFO     stop_text_encoder_training = 0
12:36:59-287020 INFO     lr_warmup_steps = 0
12:36:59-289021 INFO     Saving training config to
                         D:\Users\USER\Pictures\AI\training\SD1-5\configs\lora-base.json...
12:36:59-292022 INFO     Executing command: "D:\Users\USER\Documents\AI\kohya_ss\venv\Scripts\accelerate.EXE" launch
                         --dynamo_backend no --dynamo_mode default --mixed_precision bf16 --num_processes 1
                         --num_machines 1 --num_cpu_threads_per_process 2
                         "D:/Users/USER/Documents/AI/kohya_ss/sd-scripts/train_network.py" --config_file
                         "./outputs/tmpfilelora.toml" with shell=True
12:36:59-298021 INFO     Command executed.
2024-04-17 12:37:14 WARNING  A matching Triton is not available, some optimizations will not be enabled.  __init__.py:55
                             Error caught was: No module named 'triton'
2024-04-17 12:37:41 INFO     Loading settings from ./outputs/tmpfilelora.toml...                      train_util.py:3744
                    INFO     ./outputs/tmpfilelora                                                    train_util.py:3763
Traceback (most recent call last):
  File "D:\Users\USER\Documents\AI\kohya_ss\sd-scripts\train_network.py", line 1115, in <module>
    trainer.train(args)
  File "D:\Users\USER\Documents\AI\kohya_ss\sd-scripts\train_network.py", line 140, in train
    train_util.verify_training_args(args)
  File "D:\Users\USER\Documents\AI\kohya_ss\sd-scripts\library\train_util.py", line 3473, in verify_training_args
    raise ValueError("adaptive_noise_scale requires noise_offset / adaptive_noise_scaleを使用するにはnoise_offsetが必要 です")
ValueError: adaptive_noise_scale requires noise_offset / adaptive_noise_scaleを使用するにはnoise_offsetが必要です
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Users\USER\Documents\AI\kohya_ss\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "D:\Users\USER\Documents\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "D:\Users\USER\Documents\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "D:\Users\USER\Documents\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\\Users\\USER\\Documents\\AI\\kohya_ss\\venv\\Scripts\\python.exe', 'D:/Users/USER/Documents/AI/kohya_ss/sd-scripts/train_network.py', '--config_file', './outputs/tmpfilelora.toml']' returned non-zero exit status 1.
12:37:43-859151 INFO     Training has ended.

Thanks in advance, regards!

bmaltais commented 5 months ago

What was the file you tested? One found in the test folder?

It complain about:

raise ValueError("adaptive_noise_scale requires noise_offset / adaptive_noise_scaleを使用するにはnoise_offsetが必要 です") ValueError: adaptive_noise_scale requires noise_offset / adaptive_noise_scaleを使用するにはnoise_offsetが必要です

If you add a noise offset value it should move forward... loading old config in an updated GUI will require setting new parameters values that might have been added since it was saved and used last. I will check if the GUI is setting default values that match what as-scripts would set when not provided. Perhaps the GUI is setting adaptive noise offset is not set appropriately.

gradi01 commented 5 months ago

hello bmaltais, for fine tuning when you select sd1.5 in the model, it does not allow you to setup the resolution in 512,512 it puts it in 1024,1024 without the possibility of modifying it.

he also does not want to start the training :/ he tries to load a vae, then setup an unknown vae folder, continues and puts:

Traceback (most recent call last): File "C:\kohya\kohya_ss\sd-scripts\fine_tune.py", line 526, in train(args) File "C:\kohya\kohya_ss\sd-scripts\fine_tune.py", line 85, in train train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group) File "C:\kohya\kohya_ss\sd-scripts\library\config_util.py", line 482, in generate_dataset_group_by_blueprint dataset = dataset_klass(subsets=subsets, **asdict(dataset_blueprint.params)) File "C:\kohya\kohya_ss\sd-scripts\library\train_util.py", line 1683, in init raise ValueError(f"no metadata / メタデータファイルがありません: {subset.metadata_file}") ValueError: no metadata / メタデータファイルがありません: /meta_lat.json Traceback (most recent call last): File "C:\Users\ofuduldlutd\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\ofuduldlutd\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\kohya\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "C:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\kohya\kohya_ss\venv\Scripts\python.exe', 'C:/kohya/kohya_ss/sd-scripts/fine_tune.py', '--config_file ', './outputs/tmpfilefinetune.toml']' returned non-zero exit status 1.

bmaltais commented 5 months ago

Thank you for raising both those things, I will look at them. For fine tuning, can’t you set the resolution under dataset preparation? Fine tuning does not support the same option as dreambooth and Lora.

mperez96 commented 5 months ago

It helped! the configuration file had the following lines:

  "noise_offset": 0,
  "noise_offset_type": "Original",

after giving a value to noise_offset it worked, I'm still wondering why I can't use 0 as a value for noise_offset, but thank you for helping me!!

bmaltais commented 5 months ago

@gradi01 I can't reproduce the issue. I set the resolution to 512,512 under Dataset Preparation and it was properly used:

image

image

Can you provide a copy of your config so I can try to replicate from it?

gradi01 commented 5 months ago

@bmaltais do you think this will help the finetuning work? I would also like to know which onblget is the best for training a basemodel (checkpoint) in sd1.5 please, I'm struggling to find a way to get a correct result