bmaltais / kohya_ss

Apache License 2.0
9.55k stars 1.23k forks source link

An error occurs when multi GPUs are running #2014

Closed DDXDB closed 5 months ago

DDXDB commented 7 months ago

my PC: windows 11 23h2 22631.3155 CPU :R5 5600X GPU :ARC A770(16GB)&ARC A750 RAM : DDR4 3600Mhz(8+8+8+8GB) There is no problem when running a single GPU. Multiple Gpus will report the following error

Active code page: 65001
22:24:49-110073 INFO     headless: False
22:24:49-110073 INFO     Load CSS...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
22:25:04-376168 INFO     Loading config...
F:\kohya_ss\venv\lib\site-packages\gradio\components\dropdown.py:231: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include:  or set allow_custom_value=True.
  warnings.warn(
F:\kohya_ss\venv\lib\site-packages\gradio\components\checkbox.py:105: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. `return gr.Checkbox(...)` instead of `return gr.Checkbox.update(...)`.
  warnings.warn(
F:\kohya_ss\venv\lib\site-packages\gradio\components\textbox.py:163: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. `return gr.Textbox(...)` instead of `return gr.Textbox.update(...)`.
  warnings.warn(
F:\kohya_ss\venv\lib\site-packages\gradio\components\button.py:89: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. `return gr.Button(...)` instead of `return gr.Button.update(...)`.
  warnings.warn(
22:25:05-962271 INFO     Start training LoRA Standard ...
22:25:05-963292 INFO     Checking for duplicate image filenames in training data directory...
22:25:05-965847 INFO     Valid image folder names found in: F:/LoRA_Test/ip_image
22:25:05-968009 INFO     Folder 10_bin: 140 images found
22:25:05-968519 INFO     Folder 10_bin: 1400 steps
22:25:05-969543 INFO     Total steps: 1400
22:25:05-970575 INFO     Train batch size: 1
22:25:05-971087 INFO     Gradient accumulation steps: 1
22:25:05-971600 INFO     Epoch: 15
22:25:05-972602 INFO     Regulatization factor: 1
22:25:05-973603 INFO     max_train_steps (1400 / 1 / 1 * 15 * 1) = 21000
22:25:05-974603 INFO     stop_text_encoder_training = 0
22:25:05-975605 INFO     lr_warmup_steps = 2100
22:25:05-976605 INFO     Saving training config to F:/LoRA_Test/op_model\MuvD_AV5R_V2_20240227-222505.json...
22:25:05-978607 INFO     accelerate launch --gpu_ids="0,1" --multi_gpu --num_processes=2 --num_cpu_threads_per_process=2
                         "./train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents
                         --caption_extension=".txt" --clip_skip=2 --enable_bucket --min_bucket_reso=256
                         --max_bucket_reso=2048 --learning_rate="0.0001" --logging_dir="F:/LoRA_Test/op_logs"
                         --lr_scheduler="cosine_with_restarts" --lr_scheduler_num_cycles="15" --lr_warmup_steps="2100"
                         --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="512,512"
                         --max_train_steps="21000" --mixed_precision="bf16" --network_alpha="128" --network_dim=128
                         --network_module=networks.lora --optimizer_type="AdamW" --output_dir="F:/LoRA_Test/op_model"
                         --output_name="MuvD_AV5R_V2"
                         --pretrained_model_name_or_path="F:/Arc-AI-v2.0.3/models/Stable-diffusion/v1-5-pruned-emaonly.s
                         afetensors" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16"
                         --text_encoder_lr=5e-05 --train_batch_size="1" --train_data_dir="F:/LoRA_Test/ip_image"
                         --unet_lr=0.0001 --sdpa
F:\kohya_ss\venv\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[2024-02-27 22:25:10,720] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
2024-02-27 22:25:12,781 - root - INFO - Using nproc_per_node=2.
[W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 -            У       ĵ ַ  Ч  ).
F:\kohya_ss\venv\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
F:\kohya_ss\venv\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
prepare tokenizer
prepare tokenizer
Using DreamBooth method.
prepare images.
found directory F:\LoRA_Test\ip_image\10_bin contains 140 image files
1400 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
  batch_size: 1
  resolution: (512, 512)
  enable_bucket: True
  network_multiplier: 1.0
  min_bucket_reso: 256
  max_bucket_reso: 2048
  bucket_reso_steps: 64
  bucket_no_upscale: True

  [Subset 0 of Dataset 0]
    image_dir: "F:\LoRA_Test\ip_image\10_bin"
    image_count: 140
    num_repeats: 10
    shuffle_caption: False
    keep_tokens: 0
    keep_tokens_separator:
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: bin
    caption_extension: .txt

[Dataset 0]
loading image sizes.
100%|██████████████████████████████████████████████████████████████████████████████| 140/140 [00:00<00:00, 4441.10it/s]
make buckets
min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (512, 512), count: 1400
mean ar error (without repeats): 2.8560004569601536e-06
preparing accelerator
[W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 -            У       ĵ ַ  Ч  ).
Traceback (most recent call last):
  File "F:\kohya_ss\train_network.py", line 1033, in <module>
    trainer.train(args)
  File "F:\kohya_ss\train_network.py", line 221, in train
    accelerator = train_util.prepare_accelerator(args)
  File "F:\kohya_ss\library\train_util.py", line 3915, in prepare_accelerator
    accelerator = Accelerator(
  File "F:\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 371, in __init__
    self.state = AcceleratorState(
  File "F:\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 758, in __init__
    PartialState(cpu, **kwargs)
  File "F:\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 217, in __init__
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "F:\kohya_ss\venv\lib\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
  File "F:\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "F:\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
Using DreamBooth method.
prepare images.
found directory F:\LoRA_Test\ip_image\10_bin contains 140 image files
1400 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
  batch_size: 1
  resolution: (512, 512)
  enable_bucket: True
  network_multiplier: 1.0
  min_bucket_reso: 256
  max_bucket_reso: 2048
  bucket_reso_steps: 64
  bucket_no_upscale: True

  [Subset 0 of Dataset 0]
    image_dir: "F:\LoRA_Test\ip_image\10_bin"
    image_count: 140
    num_repeats: 10
    shuffle_caption: False
    keep_tokens: 0
    keep_tokens_separator:
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: bin
    caption_extension: .txt

[Dataset 0]
loading image sizes.
100%|██████████████████████████████████████████████████████████████████████████████| 140/140 [00:00<00:00, 4480.10it/s]
make buckets
min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (512, 512), count: 1400
mean ar error (without repeats): 2.8560004569601536e-06
preparing accelerator
[W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 -            У       ĵ ַ  Ч  ).
Traceback (most recent call last):
  File "F:\kohya_ss\train_network.py", line 1033, in <module>
    trainer.train(args)
  File "F:\kohya_ss\train_network.py", line 221, in train
    accelerator = train_util.prepare_accelerator(args)
  File "F:\kohya_ss\library\train_util.py", line 3915, in prepare_accelerator
    accelerator = Accelerator(
  File "F:\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 371, in __init__
    self.state = AcceleratorState(
  File "F:\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 758, in __init__
    PartialState(cpu, **kwargs)
  File "F:\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 217, in __init__
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "F:\kohya_ss\venv\lib\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
  File "F:\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "F:\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
[2024-02-27 22:25:22,812] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5584) of binary: F:\kohya_ss\venv\Scripts\python.exe
Traceback (most recent call last):
  File "C:\Users\98440\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\98440\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "F:\kohya_ss\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "F:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "F:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "F:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "F:\kohya_ss\venv\lib\site-packages\torch\distributed\run.py", line 797, in run
    elastic_launch(
  File "F:\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "F:\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_network.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-02-27_22:25:22
  host      : IFPC-V4
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 7808)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-27_22:25:22
  host      : IFPC-V4
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5584)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I installed kohya_ss in the following way

git clone https://github.com/bmaltais/kohya_ss.git 

cd kohya_ss 

python -m venv venv 

.\venv\Scripts\Activate.ps1 

pip install dpcpp-cpp-rt mkl-dpcpp 

pip install -r requirements.txt 

pip install https://github.com/Nuullll/intel-extension-for-pytorch/releases/download/v2.1.10%2Bxpu/intel_extension_for_pytorch-2.1.10+xpu-cp310-cp310-win_amd64.whl https://github.com/Nuullll/intel-extension-for-pytorch/releases/download/v2.1.10%2Bxpu/torch-2.1.0a0+cxx11.abi-cp310-cp310-win_amd64.whl https://github.com/Nuullll/intel-extension-for-pytorch/releases/download/v2.1.10%2Bxpu/torchaudio-2.1.0a0+cxx11.abi-cp310-cp310-win_amd64.whl https://github.com/Nuullll/intel-extension-for-pytorch/releases/download/v2.1.10%2Bxpu/torchvision-0.16.0a0+cxx11.abi-cp310-cp310-win_amd64.whl 

pip install tensorboard==2.14.1 tensorflow==2.14.0 

.\setup.ps1 --use-ipex 

4"(Optional) Manually configure accelerate"
*This machine 
*No distributed training 
no 
yes 
no 
no 
all 
bf16 
Disty0 commented 7 months ago

DDP / Multi GPU is not supported with IPEX.

Specify a device in training parameters or try setting one of these environment variable:

For IPEX:

xpu_VISIBLE_DEVICES=0

For anything that uses SYCL / OneAPI including IPEX:

ONEAPI_DEVICE_SELECTOR="ext_oneapi_level_zero_gpu:0"