bmaltais / kohya_ss

Apache License 2.0
8.79k stars 1.14k forks source link

'""' is not recognized as an internal or external command #2475

Open lideborg opened 1 month ago

lideborg commented 1 month ago

I've tried to setup everyhting correctly and think I've followed enough tutorials to think I'm right trying to train a LoRA. But this error keeps getting up when training both in dreambooth training and regular LoRA. Any idea what might be the issue?

Edit. I know 11 images are too little but right now I'm trying to get it up and running

Thanks

15:47:42-390690 INFO     Loading config...
15:48:05-666881 INFO     Start training LoRA Standard ...
15:48:05-667884 INFO     Validating lr scheduler arguments...
15:48:05-668883 INFO     Validating optimizer arguments...
15:48:05-669885 INFO     Validating model file or folder path runwayml/stable-diffusion-v1-5 existence...
15:48:05-670886 INFO     ...huggingface.co model, skipping validation
15:48:05-671886 INFO     Validating output_dir path D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model existence...
15:48:05-672887 INFO     ...valid
15:48:05-673888 INFO     Validating train_data_dir path D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img existence...
15:48:05-673888 INFO     ...valid
15:48:05-674889 INFO     reg_data_dir not specified, skipping validation
15:48:05-675890 INFO     Validating logging_dir path D:/Dropbox/Work/Feature/09_LoRA/002_vivid\log existence...
15:48:05-676891 INFO     ...valid
15:48:05-676891 INFO     log_tracker_config not specified, skipping validation
15:48:05-677891 INFO     resume not specified, skipping validation
15:48:05-678892 INFO     vae not specified, skipping validation
15:48:05-679893 INFO     network_weights not specified, skipping validation
15:48:05-680894 INFO     dataset_config not specified, skipping validation
15:48:05-681893 INFO     Folder 25_vivid object: 25 repeats found
15:48:05-682895 INFO     Folder 25_vivid object: 11 images found
15:48:05-683895 INFO     Folder 25_vivid object: 11 * 25 = 275 steps
15:48:05-684897 INFO     Regulatization factor: 1
15:48:05-684897 INFO     Total steps: 275
15:48:05-685898 INFO     Train batch size: 3
15:48:05-686899 INFO     Gradient accumulation steps: 1
15:48:05-687899 INFO     Epoch: 10
15:48:05-688900 INFO     Max train steps: 950
15:48:05-688900 INFO     stop_text_encoder_training = 0
15:48:05-689901 INFO     lr_warmup_steps = 0
15:48:05-696907 INFO     Saving training config to
                         D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model\vivid_v2_20240509-154805.json...
15:48:05-698908 INFO     Executing command: "" launch --dynamo_backend no --dynamo_mode default --mixed_precision fp16
                         --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2
                         "C:/Users/hampu/AI/Kohya/kohya_ss/sd-scripts/train_network.py" --config_file
                         "./outputs/config_lora-20240509-154805.toml" with shell=True
15:48:05-705913 INFO     Command executed.
'""' is not recognized as an internal or external command,
operable program or batch file.
15:48:05-943733 INFO     Training has ended.

Screenshot_2

bmaltais commented 1 month ago

This is odd... it is adding a "" before launch... This is causing the issue... What version of the GUI is this? Unless I can reproduce the issue it is hard to fix. I do not observe this issue on my test system.

What it should look like is D:\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --main_process_port 12345 --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 D:/kohya_ss/sd-scripts/train_db.py --config_file

Is it possible accelerate is not properly installed on your system?

bmaltais commented 1 month ago

I have added code to the dev branch that will detect when accelerate is not found and will report an error and stop appropriatly.

lideborg commented 1 month ago

Thanks for the quick response!

Not sure regarding accelerate, what's the easiest way to find out if so?

Regarding specs I'm on 24.0.9 with dual 3080 TI's

09:28:17-606548 INFO     Kohya_ss GUI version: v24.0.9
09:28:18-444182 INFO     Submodule initialized and updated.
09:28:18-447185 INFO     nVidia toolkit detected
09:28:31-844650 INFO     Torch 2.1.2+cu118
09:28:31-949246 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700
09:28:31-954250 INFO     Torch detected GPU: NVIDIA GeForce RTX 3080 Ti VRAM 12287 Arch (8, 6) Cores 80
09:28:31-956252 INFO     Torch detected GPU: NVIDIA GeForce RTX 3080 Ti VRAM 12288 Arch (8, 6) Cores 80
09:28:31-985275 INFO     Python version is 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit
                         (AMD64)]
09:28:31-987278 INFO     Verifying modules installation status from requirements_pytorch_windows.txt...
09:28:31-994285 INFO     Verifying modules installation status from requirements_windows.txt...
09:28:32-002292 INFO     Verifying modules installation status from requirements.txt...
09:29:11-833674 INFO     headless: False
09:29:12-046431 INFO     Using shell=True when running external commands...
IMPORTANT: You are using gradio version 4.26.0, however version 4.29.0 is available, please upgrade.
--------
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
bmaltais commented 1 month ago

Try upgrading to the latest release, delete the vent and run setup again.. Maybe this will resolve the missing accelerate

lideborg commented 1 month ago

Allright gave it a re-install and getting closer, still some failures though. Any idea what is happening here?

14:47:08-721334 INFO     Start training LoRA Standard ...
14:47:08-722335 INFO     Validating lr scheduler arguments...
14:47:08-723839 INFO     Validating optimizer arguments...
14:47:08-724842 INFO     Validating D:/Dropbox/Work/Feature/09_LoRA/002_vivid\log existence and writability... SUCCESS
14:47:08-725843 INFO     Validating D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model existence and writability... SUCCESS
14:47:08-726843 INFO     Validating runwayml/stable-diffusion-v1-5 existence... SKIPPING: huggingface.co model
14:47:08-727844 INFO     Validating D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img existence... SUCCESS
14:47:08-728845 INFO     Folder 25_vivid object: 25 repeats found
14:47:08-729846 INFO     Folder 25_vivid object: 11 images found
14:47:08-730847 INFO     Folder 25_vivid object: 11 * 25 = 275 steps
14:47:08-731848 INFO     Regulatization factor: 1
14:47:08-731848 INFO     Total steps: 275
14:47:08-732848 INFO     Train batch size: 3
14:47:08-733849 INFO     Gradient accumulation steps: 1
14:47:08-734850 INFO     Epoch: 10
14:47:08-735852 INFO     Max train steps: 950
14:47:08-737856 INFO     stop_text_encoder_training = 0
14:47:08-738854 INFO     lr_warmup_steps = 0
14:47:08-741855 INFO     Saving training config to D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model\vivid_v2_20240510-144708.json...
14:47:08-744860 INFO     Executing command: C:\Users\hampu\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default
                         --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2
                         C:/Users/hampu/kohya_ss/sd-scripts/train_network.py --config_file
                         D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model/config_lora-20240510-144708.toml
14:47:08-749862 INFO     Command executed.
[2024-05-10 14:47:12,474] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[W socket.cpp:663] [c10d] The client socket has failed to connect to [LideTower]:29500 (system error: 10049 - The requested address is not valid in its context.).
2024-05-10 14:47:20 INFO     Loading settings from                                                                             train_util.py:3744
                             D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model/config_lora-20240510-144708.toml...
                    INFO     D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model/config_lora-20240510-144708                       train_util.py:3763
2024-05-10 14:47:20 INFO     prepare tokenizer                                                                                 train_util.py:4227
                    INFO     update token length: 75                                                                           train_util.py:4244
                    INFO     Using DreamBooth method.                                                                        train_network.py:172
                    INFO     prepare images.                                                                                   train_util.py:1572
                    INFO     found directory D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img\25_vivid object contains 11 image   train_util.py:1519
                             files
                    INFO     275 train images with repeating.                                                                  train_util.py:1613
                    INFO     0 reg images.                                                                                     train_util.py:1616
                    WARNING  no regularization images / 正則化画像が見つかりませんでした                                       train_util.py:1621
                    INFO     [Dataset 0]                                                                                       config_util.py:565
                               batch_size: 3
                               resolution: (512, 512)
                               enable_bucket: False
                               network_multiplier: 1.0

                               [Subset 0 of Dataset 0]
                                 image_dir: "D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img\25_vivid object"
                                 image_count: 11
                                 num_repeats: 25
                                 shuffle_caption: False
                                 keep_tokens: 0
                                 keep_tokens_separator:
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: vivid object
                                 caption_extension: .txt

                    INFO     [Dataset 0]                                                                                       config_util.py:571
                    INFO     loading image sizes.                                                                               train_util.py:853
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<?, ?it/s]
                    INFO     prepare dataset                                                                                    train_util.py:861
                    INFO     preparing accelerator                                                                           train_network.py:225
[W socket.cpp:663] [c10d] The client socket has failed to connect to [LideTower]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
  File "C:\Users\hampu\kohya_ss\sd-scripts\train_network.py", line 1115, in <module>
    trainer.train(args)
  File "C:\Users\hampu\kohya_ss\sd-scripts\train_network.py", line 226, in train
    accelerator = train_util.prepare_accelerator(args)
  File "C:\Users\hampu\kohya_ss\sd-scripts\library\train_util.py", line 4305, in prepare_accelerator
    accelerator = Accelerator(
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 371, in __init__
    self.state = AcceleratorState(
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 758, in __init__
    PartialState(cpu, **kwargs)
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 217, in __init__
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
[2024-05-10 14:47:24,541] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 23600) of binary: C:\Users\hampu\kohya_ss\venv\Scripts\python.exe
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\hampu\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\run.py", line 797, in run
    elastic_launch(
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
C:/Users/hampu/kohya_ss/sd-scripts/train_network.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-10_14:47:24
  host      : LideTower
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23600)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
14:47:25-151738 INFO     Training has ended.
Keyboard interruption in main thread... closing server.
bmaltais commented 1 month ago

No idea... but it appear to complain about socket connections... perhaps some king of antivirus causing network access issues?

avan06 commented 1 month ago

I saw in the logs that torch is complaining about the error message "Distributed package doesn't have NCCL built in." Could this be related to CUDA not being installed correctly?

Reference: https://discuss.pytorch.org/t/runtimeerror-distributed-package-doesnt-have-nccl-built-in/176744

lideborg commented 1 month ago

Thanks for the followup here. I'll try to reinstall drivers and see if that gets it going.