Open lideborg opened 1 month ago
This is odd... it is adding a "" before launch... This is causing the issue... What version of the GUI is this? Unless I can reproduce the issue it is hard to fix. I do not observe this issue on my test system.
What it should look like is D:\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --main_process_port 12345 --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 D:/kohya_ss/sd-scripts/train_db.py --config_file
Is it possible accelerate is not properly installed on your system?
I have added code to the dev
branch that will detect when accelerate is not found and will report an error and stop appropriatly.
Thanks for the quick response!
Not sure regarding accelerate, what's the easiest way to find out if so?
Regarding specs I'm on 24.0.9 with dual 3080 TI's
09:28:17-606548 INFO Kohya_ss GUI version: v24.0.9
09:28:18-444182 INFO Submodule initialized and updated.
09:28:18-447185 INFO nVidia toolkit detected
09:28:31-844650 INFO Torch 2.1.2+cu118
09:28:31-949246 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700
09:28:31-954250 INFO Torch detected GPU: NVIDIA GeForce RTX 3080 Ti VRAM 12287 Arch (8, 6) Cores 80
09:28:31-956252 INFO Torch detected GPU: NVIDIA GeForce RTX 3080 Ti VRAM 12288 Arch (8, 6) Cores 80
09:28:31-985275 INFO Python version is 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit
(AMD64)]
09:28:31-987278 INFO Verifying modules installation status from requirements_pytorch_windows.txt...
09:28:31-994285 INFO Verifying modules installation status from requirements_windows.txt...
09:28:32-002292 INFO Verifying modules installation status from requirements.txt...
09:29:11-833674 INFO headless: False
09:29:12-046431 INFO Using shell=True when running external commands...
IMPORTANT: You are using gradio version 4.26.0, however version 4.29.0 is available, please upgrade.
--------
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Try upgrading to the latest release, delete the vent and run setup again.. Maybe this will resolve the missing accelerate
Allright gave it a re-install and getting closer, still some failures though. Any idea what is happening here?
14:47:08-721334 INFO Start training LoRA Standard ...
14:47:08-722335 INFO Validating lr scheduler arguments...
14:47:08-723839 INFO Validating optimizer arguments...
14:47:08-724842 INFO Validating D:/Dropbox/Work/Feature/09_LoRA/002_vivid\log existence and writability... SUCCESS
14:47:08-725843 INFO Validating D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model existence and writability... SUCCESS
14:47:08-726843 INFO Validating runwayml/stable-diffusion-v1-5 existence... SKIPPING: huggingface.co model
14:47:08-727844 INFO Validating D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img existence... SUCCESS
14:47:08-728845 INFO Folder 25_vivid object: 25 repeats found
14:47:08-729846 INFO Folder 25_vivid object: 11 images found
14:47:08-730847 INFO Folder 25_vivid object: 11 * 25 = 275 steps
14:47:08-731848 INFO Regulatization factor: 1
14:47:08-731848 INFO Total steps: 275
14:47:08-732848 INFO Train batch size: 3
14:47:08-733849 INFO Gradient accumulation steps: 1
14:47:08-734850 INFO Epoch: 10
14:47:08-735852 INFO Max train steps: 950
14:47:08-737856 INFO stop_text_encoder_training = 0
14:47:08-738854 INFO lr_warmup_steps = 0
14:47:08-741855 INFO Saving training config to D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model\vivid_v2_20240510-144708.json...
14:47:08-744860 INFO Executing command: C:\Users\hampu\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default
--mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2
C:/Users/hampu/kohya_ss/sd-scripts/train_network.py --config_file
D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model/config_lora-20240510-144708.toml
14:47:08-749862 INFO Command executed.
[2024-05-10 14:47:12,474] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[W socket.cpp:663] [c10d] The client socket has failed to connect to [LideTower]:29500 (system error: 10049 - The requested address is not valid in its context.).
2024-05-10 14:47:20 INFO Loading settings from train_util.py:3744
D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model/config_lora-20240510-144708.toml...
INFO D:/Dropbox/Work/Feature/09_LoRA/002_vivid\model/config_lora-20240510-144708 train_util.py:3763
2024-05-10 14:47:20 INFO prepare tokenizer train_util.py:4227
INFO update token length: 75 train_util.py:4244
INFO Using DreamBooth method. train_network.py:172
INFO prepare images. train_util.py:1572
INFO found directory D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img\25_vivid object contains 11 image train_util.py:1519
files
INFO 275 train images with repeating. train_util.py:1613
INFO 0 reg images. train_util.py:1616
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1621
INFO [Dataset 0] config_util.py:565
batch_size: 3
resolution: (512, 512)
enable_bucket: False
network_multiplier: 1.0
[Subset 0 of Dataset 0]
image_dir: "D:\Dropbox\Work\Feature\09_LoRA\002_vivid\img\25_vivid object"
image_count: 11
num_repeats: 25
shuffle_caption: False
keep_tokens: 0
keep_tokens_separator:
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: vivid object
caption_extension: .txt
INFO [Dataset 0] config_util.py:571
INFO loading image sizes. train_util.py:853
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<?, ?it/s]
INFO prepare dataset train_util.py:861
INFO preparing accelerator train_network.py:225
[W socket.cpp:663] [c10d] The client socket has failed to connect to [LideTower]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "C:\Users\hampu\kohya_ss\sd-scripts\train_network.py", line 1115, in <module>
trainer.train(args)
File "C:\Users\hampu\kohya_ss\sd-scripts\train_network.py", line 226, in train
accelerator = train_util.prepare_accelerator(args)
File "C:\Users\hampu\kohya_ss\sd-scripts\library\train_util.py", line 4305, in prepare_accelerator
accelerator = Accelerator(
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 371, in __init__
self.state = AcceleratorState(
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 758, in __init__
PartialState(cpu, **kwargs)
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 217, in __init__
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group
default_pg, _ = _new_process_group_helper(
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
[2024-05-10 14:47:24,541] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 23600) of binary: C:\Users\hampu\kohya_ss\venv\Scripts\python.exe
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\hampu\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\run.py", line 797, in run
elastic_launch(
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\hampu\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
C:/Users/hampu/kohya_ss/sd-scripts/train_network.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-10_14:47:24
host : LideTower
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 23600)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
14:47:25-151738 INFO Training has ended.
Keyboard interruption in main thread... closing server.
No idea... but it appear to complain about socket connections... perhaps some king of antivirus causing network access issues?
I saw in the logs that torch is complaining about the error message "Distributed package doesn't have NCCL built in." Could this be related to CUDA not being installed correctly?
Reference: https://discuss.pytorch.org/t/runtimeerror-distributed-package-doesnt-have-nccl-built-in/176744
Thanks for the followup here. I'll try to reinstall drivers and see if that gets it going.
I've tried to setup everyhting correctly and think I've followed enough tutorials to think I'm right trying to train a LoRA. But this error keeps getting up when training both in dreambooth training and regular LoRA. Any idea what might be the issue?
Edit. I know 11 images are too little but right now I'm trying to get it up and running
Thanks