'""' is not recognized as an internal or external command #2475

Open lideborg opened 1 month ago

lideborg commented 1 month ago

I've tried to setup everyhting correctly and think I've followed enough tutorials to think I'm right trying to train a LoRA. But this error keeps getting up when training both in dreambooth training and regular LoRA. Any idea what might be the issue?

Edit. I know 11 images are too little but right now I'm trying to get it up and running


15:47:42-390690 INFO     Loading config...
15:48:05-666881 INFO     Start training LoRA Standard ...
15:48:05-667884 INFO     Validating lr scheduler arguments...
15:48:05-668883 INFO     Validating optimizer arguments...
15:48:05-669885 INFO     Validating model file or folder path runwayml/stable-diffusion-v1-5 existence...
15:48:05-670886 INFO model, skipping validation
15:48:05-671886 INFO     Validating output_dir path D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model existence...
15:48:05-672887 INFO     ...valid
15:48:05-673888 INFO     Validating train_data_dir path D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img existence...
15:48:05-673888 INFO     ...valid
15:48:05-674889 INFO     reg_data_dir not specified, skipping validation
15:48:05-675890 INFO     Validating logging_dir path D:/Dropbox/Work/Feature/09_LoRA/002_vivid\log existence...
15:48:05-676891 INFO     ...valid
15:48:05-676891 INFO     log_tracker_config not specified, skipping validation
15:48:05-677891 INFO     resume not specified, skipping validation
15:48:05-678892 INFO     vae not specified, skipping validation
15:48:05-679893 INFO     network_weights not specified, skipping validation
15:48:05-680894 INFO     dataset_config not specified, skipping validation
15:48:05-681893 INFO     Folder 25_vivid object: 25 repeats found
15:48:05-682895 INFO     Folder 25_vivid object: 11 images found
15:48:05-683895 INFO     Folder 25_vivid object: 11 * 25 = 275 steps
15:48:05-684897 INFO     Regulatization factor: 1
15:48:05-684897 INFO     Total steps: 275
15:48:05-685898 INFO     Train batch size: 3
15:48:05-686899 INFO     Gradient accumulation steps: 1
15:48:05-687899 INFO     Epoch: 10
15:48:05-688900 INFO     Max train steps: 950
15:48:05-688900 INFO     stop_text_encoder_training = 0
15:48:05-689901 INFO     lr_warmup_steps = 0
15:48:05-696907 INFO     Saving training config to
15:48:05-698908 INFO     Executing command: "" launch --dynamo_backend no --dynamo_mode default --mixed_precision fp16
                         --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2
                         "C:/Users/hampu/AI/Kohya/kohya_ss/sd-scripts/" --config_file
                         "./outputs/config_lora-20240509-154805.toml" with shell=True
15:48:05-705913 INFO     Command executed.
'""' is not recognized as an internal or external command,
operable program or batch file.
15:48:05-943733 INFO     Training has ended.


bmaltais commented 1 month ago

This is odd... it is adding a "" before launch... This is causing the issue... What version of the GUI is this? Unless I can reproduce the issue it is hard to fix. I do not observe this issue on my test system.

What it should look like is D:\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --main_process_port 12345 --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 D:/kohya_ss/sd-scripts/ --config_file

Is it possible accelerate is not properly installed on your system?

bmaltais commented 1 month ago

I have added code to the dev branch that will detect when accelerate is not found and will report an error and stop appropriatly.

lideborg commented 1 month ago

Thanks for the quick response!

Not sure regarding accelerate, what's the easiest way to find out if so?

Regarding specs I'm on 24.0.9 with dual 3080 TI's

09:28:17-606548 INFO     Kohya_ss GUI version: v24.0.9
09:28:18-444182 INFO     Submodule initialized and updated.
09:28:18-447185 INFO     nVidia toolkit detected
09:28:31-844650 INFO     Torch 2.1.2+cu118
09:28:31-949246 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700
09:28:31-954250 INFO     Torch detected GPU: NVIDIA GeForce RTX 3080 Ti VRAM 12287 Arch (8, 6) Cores 80
09:28:31-956252 INFO     Torch detected GPU: NVIDIA GeForce RTX 3080 Ti VRAM 12288 Arch (8, 6) Cores 80
09:28:31-985275 INFO     Python version is 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit
09:28:31-987278 INFO     Verifying modules installation status from requirements_pytorch_windows.txt...
09:28:31-994285 INFO     Verifying modules installation status from requirements_windows.txt...
09:28:32-002292 INFO     Verifying modules installation status from requirements.txt...
09:29:11-833674 INFO     headless: False
09:29:12-046431 INFO     Using shell=True when running external commands...
IMPORTANT: You are using gradio version 4.26.0, however version 4.29.0 is available, please upgrade.
Running on local URL:

To create a public link, set `share=True` in `launch()`.
bmaltais commented 1 month ago

Try upgrading to the latest release, delete the vent and run setup again.. Maybe this will resolve the missing accelerate

lideborg commented 1 month ago

Allright gave it a re-install and getting closer, still some failures though. Any idea what is happening here?

14:47:08-721334 INFO     Start training LoRA Standard ...
14:47:08-722335 INFO     Validating lr scheduler arguments...
14:47:08-723839 INFO     Validating optimizer arguments...
14:47:08-724842 INFO     Validating D:/Dropbox/Work/Feature/09_LoRA/002_vivid\log existence and writability... SUCCESS
14:47:08-725843 INFO     Validating D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model existence and writability... SUCCESS
14:47:08-726843 INFO     Validating runwayml/stable-diffusion-v1-5 existence... SKIPPING: model
14:47:08-727844 INFO     Validating D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img existence... SUCCESS
14:47:08-728845 INFO     Folder 25_vivid object: 25 repeats found
14:47:08-729846 INFO     Folder 25_vivid object: 11 images found
14:47:08-730847 INFO     Folder 25_vivid object: 11 * 25 = 275 steps
14:47:08-731848 INFO     Regulatization factor: 1
14:47:08-731848 INFO     Total steps: 275
14:47:08-732848 INFO     Train batch size: 3
14:47:08-733849 INFO     Gradient accumulation steps: 1
14:47:08-734850 INFO     Epoch: 10
14:47:08-735852 INFO     Max train steps: 950
14:47:08-737856 INFO     stop_text_encoder_training = 0
14:47:08-738854 INFO     lr_warmup_steps = 0
14:47:08-741855 INFO     Saving training config to D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model\vivid_v2_20240510-144708.json...
14:47:08-744860 INFO     Executing command: C:\Users\hampu\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default
                         --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2
                         C:/Users/hampu/kohya_ss/sd-scripts/ --config_file
14:47:08-749862 INFO     Command executed.
[2024-05-10 14:47:12,474] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[W socket.cpp:663] [c10d] The client socket has failed to connect to [LideTower]:29500 (system error: 10049 - The requested address is not valid in its context.).
2024-05-10 14:47:20 INFO     Loading settings from                                                                   
                    INFO     D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model/config_lora-20240510-144708             
2024-05-10 14:47:20 INFO     prepare tokenizer                                                                       
                    INFO     update token length: 75                                                                 
                    INFO     Using DreamBooth method.                                                              
                    INFO     prepare images.                                                                         
                    INFO     found directory D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img\25_vivid object contains 11 image
                    INFO     275 train images with repeating.                                                        
                    INFO     0 reg images.                                                                           
                    WARNING  no regularization images / 正則化画像が見つかりませんでした                             
                    INFO     [Dataset 0]                                                                             
                               batch_size: 3
                               resolution: (512, 512)
                               enable_bucket: False
                               network_multiplier: 1.0

                               [Subset 0 of Dataset 0]
                                 image_dir: "D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img\25_vivid object"
                                 image_count: 11
                                 num_repeats: 25
                                 shuffle_caption: False
                                 keep_tokens: 0
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: vivid object
                                 caption_extension: .txt

                    INFO     [Dataset 0]                                                                             
                    INFO     loading image sizes.                                                                     
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<?, ?it/s]
                    INFO     prepare dataset                                                                          
                    INFO     preparing accelerator                                                                 
[W socket.cpp:663] [c10d] The client socket has failed to connect to [LideTower]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
  File "C:\Users\hampu\kohya_ss\sd-scripts\", line 1115, in <module>
  File "C:\Users\hampu\kohya_ss\sd-scripts\", line 226, in train
    accelerator = train_util.prepare_accelerator(args)
  File "C:\Users\hampu\kohya_ss\sd-scripts\library\", line 4305, in prepare_accelerator
    accelerator = Accelerator(
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\", line 371, in __init__
    self.state = AcceleratorState(
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\", line 758, in __init__
    PartialState(cpu, **kwargs)
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\", line 217, in __init__
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\", line 74, in wrapper
    func_return = func(*args, **kwargs)
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\", line 1148, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\", line 1268, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
[2024-05-10 14:47:24,541] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 23600) of binary: C:\Users\hampu\kohya_ss\venv\Scripts\python.exe
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\hampu\kohya_ss\venv\Scripts\accelerate.EXE\", line 7, in <module>
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\commands\", line 47, in main
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\commands\", line 1008, in launch_command
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\commands\", line 666, in multi_gpu_launcher
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\", line 797, in run
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\", line 264, in launch_agent
    raise ChildFailedError(
C:/Users/hampu/kohya_ss/sd-scripts/ FAILED
Root Cause (first observed failure):
  time      : 2024-05-10_14:47:24
  host      : LideTower
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23600)
  error_file: <N/A>
  traceback : To enable traceback see:
14:47:25-151738 INFO     Training has ended.
Keyboard interruption in main thread... closing server.
bmaltais commented 1 month ago

No idea... but it appear to complain about socket connections... perhaps some king of antivirus causing network access issues?

avan06 commented 1 month ago

I saw in the logs that torch is complaining about the error message "Distributed package doesn't have NCCL built in." Could this be related to CUDA not being installed correctly?


lideborg commented 1 month ago

Thanks for the followup here. I'll try to reinstall drivers and see if that gets it going.