bmaltais / kohya_ss

Apache License 2.0
9.66k stars 1.24k forks source link

train error #897

Closed kkklo closed 9 months ago

kkklo commented 1 year ago

To create a public link, set share=True in launch(). 02:09:20-964813 INFO SD v2 v_parameterization detected. Setting --v2 parameter and --v_parameterization 02:09:57-177145 INFO Loading config... 02:09:59-898536 INFO SD v2 v_parameterization detected. Setting --v2 parameter and --v_parameterization 02:10:45-025677 INFO Start training LoRA Standard ... 02:10:45-027644 INFO Folder 100_aleng: 141 images found 02:10:45-029639 INFO Folder 100_aleng: 14100 steps 02:10:45-030674 INFO Total steps: 14100 02:10:45-031661 INFO Train batch size: 2 02:10:45-032660 INFO Gradient accumulation steps: 1.0 02:10:45-033635 INFO Epoch: 1 02:10:45-034653 INFO Regulatization factor: 1 02:10:45-035650 INFO max_train_steps (14100 / 2 / 1.0 1 1) = 7050 02:10:45-036620 INFO stop_text_encoder_training = 0 02:10:45-038644 INFO lr_warmup_steps = 0 02:10:45-039641 INFO accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --v2 --v_parameterization --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" --train_data_dir="F:/Stable Diffusion/aleng/image" --resolution=768,768 --output_dir="F:/Stable Diffusion/aleng/model" --logging_dir="F:/Stable Diffusion/aleng/log" --network_alpha="128" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-05 --unet_lr=0.0001 --network_dim=128 --output_name="aleng" --lr_scheduler_num_cycles="1" --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="2" --max_train_steps="7050" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --seed="1234" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="1" --clip_skip=2 --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale [02:10:53] WARNING NOTE: Redirects are currently not supported in Windows or MacOs. redirects.py:27[W ..\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [DESKTOP-MU52LFJ]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [DESKTOP-MU52LFJ]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). v2 with clip_skip will be unexpected / v2でclip_skipを使用することは想定されていません prepare tokenizer Using DreamBooth method. prepare images. found directory F:\Stable Diffusion\aleng\image\100_aleng contains 141 image files No caption file found for 42 images. Training will continue without captions for these images. If class token exists, it will be used. / 42枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。 F:\Stable Diffusion\aleng\image\100_aleng\Snipaste_2023-06-03_17-39-46.png F:\Stable Diffusion\aleng\image\100_aleng\Snipaste_2023-06-03_17-40-11.png F:\Stable Diffusion\aleng\image\100_aleng\Snipaste_2023-06-03_17-40-26.png F:\Stable Diffusion\aleng\image\100_aleng\Snipaste_2023-06-03_17-40-35.png F:\Stable Diffusion\aleng\image\100_aleng\Snipaste_2023-06-03_17-41-23.png F:\Stable Diffusion\aleng\image\100_aleng\Snipaste_2023-06-03_17-41-47.png... and 37 more 14100 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 2 resolution: (768, 768) enable_bucket: False

[Subset 0 of Dataset 0] image_dir: "F:\Stable Diffusion\aleng\image\100_aleng" image_count: 141 num_repeats: 100 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: aleng caption_extension: .txt

[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████| 141/141 [00:00<00:00, 9024.69it/s] prepare dataset preparing accelerator F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead. warnings.warn( [W ..\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [DESKTOP-MU52LFJ]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [DESKTOP-MU52LFJ]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ F:\Stable Diffusion\kohya_ss\train_network.py:814 in │ │ │ │ 811 │ args = parser.parse_args() │ │ 812 │ args = train_util.read_config_from_file(args, parser) │ │ 813 │ │ │ ❱ 814 │ train(args) │ │ 815 │ │ │ │ F:\Stable Diffusion\kohya_ss\train_network.py:139 in train │ │ │ │ 136 │ │ │ 137 │ # acceleratorを準備する │ │ 138 │ print("preparing accelerator") │ │ ❱ 139 │ accelerator, unwrap_model = train_util.prepare_accelerator(args) │ │ 140 │ is_main_process = accelerator.is_main_process │ │ 141 │ │ │ 142 │ # mixed precisionに対応した型を用意しておき適宜castする │ │ │ │ F:\Stable Diffusion\kohya_ss\library\train_util.py:2975 in prepare_accelerator │ │ │ │ 2972 │ │ │ if args.wandb_api_key is not None: │ │ 2973 │ │ │ │ wandb.login(key=args.wandb_api_key) │ │ 2974 │ │ │ ❱ 2975 │ accelerator = Accelerator( │ │ 2976 │ │ gradient_accumulation_steps=args.gradient_accumulation_steps, │ │ 2977 │ │ mixed_precision=args.mixed_precision, │ │ 2978 │ │ log_with=log_with, │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:346 in init │ │ │ │ 343 │ │ │ │ │ │ self.fp8_recipe_handler = handler │ │ 344 │ │ │ │ 345 │ │ kwargs = self.init_handler.to_kwargs() if self.init_handler is not None else {} │ │ ❱ 346 │ │ self.state = AcceleratorState( │ │ 347 │ │ │ mixed_precision=mixed_precision, │ │ 348 │ │ │ cpu=cpu, │ │ 349 │ │ │ dynamo_plugin=dynamo_plugin, │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\accelerate\state.py:540 in init │ │ │ │ 537 │ │ if parse_flag_from_env("ACCELERATE_USE_CPU"): │ │ 538 │ │ │ cpu = True │ │ 539 │ │ if PartialState._shared_state == {}: │ │ ❱ 540 │ │ │ PartialState(cpu, kwargs) │ │ 541 │ │ self.dict.update(PartialState._shared_state) │ │ 542 │ │ self._check_initialized(mixed_precision, cpu) │ │ 543 │ │ if not self.initialized: │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\accelerate\state.py:129 in init │ │ │ │ 126 │ │ │ elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu: │ │ 127 │ │ │ │ self.distributed_type = DistributedType.MULTI_GPU │ │ 128 │ │ │ │ if not torch.distributed.is_initialized(): │ │ ❱ 129 │ │ │ │ │ torch.distributed.init_process_group(backend="nccl", kwargs) │ │ 130 │ │ │ │ │ self.backend = "nccl" │ │ 131 │ │ │ │ self.num_processes = torch.distributed.get_world_size() │ │ 132 │ │ │ │ self.process_index = torch.distributed.get_rank() │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py:602 in │ │ init_process_group │ │ │ │ 599 │ │ │ # different systems (e.g. RPC) in case the store is multi-tenant. │ │ 600 │ │ │ store = PrefixStore("default_pg", store) │ │ 601 │ │ │ │ ❱ 602 │ │ default_pg = _new_process_group_helper( │ │ 603 │ │ │ world_size, │ │ 604 │ │ │ rank, │ │ 605 │ │ │ [], │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py:727 in │ │ _new_process_group_helper │ │ │ │ 724 │ │ │ _pg_names[pg] = group_name │ │ 725 │ │ elif backend == Backend.NCCL: │ │ 726 │ │ │ if not is_nccl_available(): │ │ ❱ 727 │ │ │ │ raise RuntimeError("Distributed package doesn't have NCCL " "built in") │ │ 728 │ │ │ if pg_options is not None: │ │ 729 │ │ │ │ assert isinstance( │ │ 730 │ │ │ │ │ pg_options, ProcessGroupNCCL.Options │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Distributed package doesn't have NCCL built in [02:11:04] ERROR failed (exitcode: 1) local_rank: 0 (pid: 5072) of binary: F:\Stable api.py:671 Diffusion\kohya_ss\venv\Scripts\python.exe ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ C:\Users\kira\AppData\Local\Programs\Python\Python310\lib\runpy.py:196 in _run_module_as_main │ │ │ │ 193 │ main_globals = sys.modules["main"].dict │ │ 194 │ if alter_argv: │ │ 195 │ │ sys.argv[0] = mod_spec.origin │ │ ❱ 196 │ return _run_code(code, main_globals, None, │ │ 197 │ │ │ │ │ "main", mod_spec) │ │ 198 │ │ 199 def run_module(mod_name, init_globals=None, │ │ │ │ C:\Users\kira\AppData\Local\Programs\Python\Python310\lib\runpy.py:86 in _run_code │ │ │ │ 83 │ │ │ │ │ loader = loader, │ │ 84 │ │ │ │ │ package = pkg_name, │ │ 85 │ │ │ │ │ spec = mod_spec) │ │ ❱ 86 │ exec(code, run_globals) │ │ 87 │ return run_globals │ │ 88 │ │ 89 def _run_module_code(code, init_globals=None, │ │ │ │ in :7 │ │ │ │ 4 from accelerate.commands.accelerate_cli import main │ │ 5 if name == 'main': │ │ 6 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 7 │ sys.exit(main()) │ │ 8 │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py:45 in │ │ main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:914 in │ │ launch_command │ │ │ │ 911 │ elif args.use_megatron_lm and not args.cpu: │ │ 912 │ │ multi_gpu_launcher(args) │ │ 913 │ elif args.multi_gpu and not args.cpu: │ │ ❱ 914 │ │ multi_gpu_launcher(args) │ │ 915 │ elif args.tpu and not args.cpu: │ │ 916 │ │ if args.tpu_use_cluster: │ │ 917 │ │ │ tpu_pod_launcher(args) │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:603 in │ │ multi_gpu_launcher │ │ │ │ 600 │ ) │ │ 601 │ with patch_environment(*current_env): │ │ 602 │ │ try: │ │ ❱ 603 │ │ │ distrib_run.run(args) │ │ 604 │ │ except Exception: │ │ 605 │ │ │ if is_rich_available() and debug: │ │ 606 │ │ │ │ console = get_console() │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\torch\distributed\run.py:752 in run │ │ │ │ 749 │ │ ) │ │ 750 │ │ │ 751 │ config, cmd, cmd_args = config_from_args(args) │ │ ❱ 752 │ elastic_launch( │ │ 753 │ │ config=config, │ │ 754 │ │ entrypoint=cmd, │ │ 755 │ )(cmd_args) │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py:131 in │ │ call │ │ │ │ 128 │ │ self._entrypoint = entrypoint │ │ 129 │ │ │ 130 │ def call(self, *args): │ │ ❱ 131 │ │ return launch_agent(self._config, self._entrypoint, list(args)) │ │ 132 │ │ 133 │ │ 134 def _get_entrypoint_name( │ │ │ │ F:\Stable Diffusion\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py:245 in │ │ launch_agent │ │ │ │ 242 │ │ │ # if the error files for the failed children exist │ │ 243 │ │ │ # @record will copy the first error (root cause) │ │ 244 │ │ │ # to the error file of the launcher process. │ │ ❱ 245 │ │ │ raise ChildFailedError( │ │ 246 │ │ │ │ name=entrypoint_name, │ │ 247 │ │ │ │ failures=result.failures, │ │ 248 │ │ │ ) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ChildFailedError:

train_network.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-06-04_02:11:04 host : DESKTOP-MU52LFJ.lan rank : 0 (local_rank: 0) exitcode : 1 (pid: 5072) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
bmaltais commented 1 year ago

v21.6.5 should fix this.

rdcoder33 commented 1 year ago

v21.6.5 should fix this.

of what, accelerate ?

oliverban commented 1 year ago

I have this error as well, most recent version. Torch 2. Fresh install. Lora Training.