bmaltais / kohya_ss

Apache License 2.0
9.28k stars 1.2k forks source link

Multi-GPU not working on both Windows and Linux #2484

Open okingjo opened 4 months ago

okingjo commented 4 months ago

Issue: Multi GPU training not working ever since the "accelerate launch" has been added

Machine I have tried:

Accelerate config: It looks it does not matter, the error message is the same no mater I config it to use distributed training or not. The config I used to use to run on multi-GPU was, "distributed training -yes", "dynamo, deepspeed etc... -no", and set the correct number of GPU, and the "ALL" to use all the GPU I got. The following error message was under this config.

Version The one I`m using now is v24.0.3 But this issue has been there ever since the accelerate launch has been added

Error Message: This is from the 4x A100 machine, running on docker

14:45:42-103931 INFO     Start training LoRA Standard ...
14:45:42-105577 INFO     Validating lr scheduler arguments...
14:45:42-106378 INFO     Validating optimizer arguments...
14:45:42-107054 INFO     Validating /dataset/lora/ruanmei/log/ existence and writability... SUCCESS
14:45:42-107750 INFO     Validating /dataset/lora/ruanmei/model/ existence and writability...
                         SUCCESS
14:45:42-108480 INFO     Validating /dataset/base_model/animagine-xl-3.0-base.safetensors
                         existence... SUCCESS
14:45:42-109179 INFO     Validating /dataset/lora/ruanmei/img/ existence... SUCCESS
14:45:42-109829 INFO     Validating /dataset/vae/sdxl_vae.safetensors existence... SUCCESS
14:45:42-110513 INFO     Headless mode, skipping verification if model already exist... if model
                         already exist it will be overwritten...
14:45:42-111409 INFO     Folder 3_ruan_mei_(honkai_star_rail) 1girl: 3 repeats found
14:45:42-112648 INFO     Folder 3_ruan_mei_(honkai_star_rail) 1girl: 230 images found
14:45:42-113435 INFO     Folder 3_ruan_mei_(honkai_star_rail) 1girl: 230 * 3 = 690 steps
14:45:42-114177 INFO     Folder 2_ruan_mei_(honkai_star_rail) 1girl: 2 repeats found
14:45:42-115183 INFO     Folder 2_ruan_mei_(honkai_star_rail) 1girl: 247 images found
14:45:42-115878 INFO     Folder 2_ruan_mei_(honkai_star_rail) 1girl: 247 * 2 = 494 steps
14:45:42-116578 INFO     Folder 5_ruan_mei_(honkai_star_rail) 1girl: 5 repeats found
14:45:42-117377 INFO     Folder 5_ruan_mei_(honkai_star_rail) 1girl: 85 images found
14:45:42-118077 INFO     Folder 5_ruan_mei_(honkai_star_rail) 1girl: 85 * 5 = 425 steps
14:45:42-118774 INFO     Folder 6_ruan_mei_(honkai_star_rail) 1girl: 6 repeats found
14:45:42-119891 INFO     Folder 6_ruan_mei_(honkai_star_rail) 1girl: 89 images found
14:45:42-120726 INFO     Folder 6_ruan_mei_(honkai_star_rail) 1girl: 89 * 6 = 534 steps
14:45:42-121576 INFO     Regulatization factor: 1
14:45:42-122336 INFO     Total steps: 2143
14:45:42-123044 INFO     Train batch size: 2
14:45:42-123756 INFO     Gradient accumulation steps: 1
14:45:42-124479 INFO     Epoch: 20
14:45:42-125167 INFO     max_train_steps (2143 / 2 / 1 * 20 * 1) = 21430
14:45:42-126096 INFO     stop_text_encoder_training = 0
14:45:42-126801 INFO     lr_warmup_steps = 2143
14:45:42-128322 INFO     Saving training config to
                         /dataset/lora/ruanmei/model/Char-HonkaiSR-Ruanmei-XL-V1_20240510-144542.jso
                         n...
14:45:42-129527 INFO     Executing command: /home/1000/.local/bin/accelerate launch --dynamo_backend
                         no --dynamo_mode default --gpu_ids 0,1,2,3 --mixed_precision no --multi_gpu
                         --num_processes 4 --num_machines 1 --num_cpu_threads_per_process 2
                         /app/sd-scripts/sdxl_train_network.py --config_file
                         /dataset/lora/ruanmei/model//config_lora-20240510-144542.toml
14:45:42-131952 INFO     Command executed.
2024-05-10 14:45:48 INFO     Loading settings from                                train_util.py:3744
                             /dataset/lora/ruanmei/model//config_lora-20240510-14
                             4542.toml...
                    INFO     /dataset/lora/ruanmei/model//config_lora-20240510-14 train_util.py:3763
                             4542
2024-05-10 14:45:48 INFO     prepare tokenizers                               sdxl_train_util.py:134
2024-05-10 14:45:48 INFO     Loading settings from                                train_util.py:3744
                             /dataset/lora/ruanmei/model//config_lora-20240510-14
                             4542.toml...
2024-05-10 14:45:48 INFO     Loading settings from                                train_util.py:3744
                             /dataset/lora/ruanmei/model//config_lora-20240510-14
                             4542.toml...
                    INFO     /dataset/lora/ruanmei/model//config_lora-20240510-14 train_util.py:3763
                             4542
                    INFO     /dataset/lora/ruanmei/model//config_lora-20240510-14 train_util.py:3763
                             4542
2024-05-10 14:45:48 INFO     prepare tokenizers                               sdxl_train_util.py:134
2024-05-10 14:45:48 INFO     prepare tokenizers                               sdxl_train_util.py:134
2024-05-10 14:45:49 INFO     Loading settings from                                train_util.py:3744
                             /dataset/lora/ruanmei/model//config_lora-20240510-14
                             4542.toml...
                    INFO     /dataset/lora/ruanmei/model//config_lora-20240510-14 train_util.py:3763
                             4542
2024-05-10 14:45:49 INFO     prepare tokenizers                               sdxl_train_util.py:134
Traceback (most recent call last):
  File "/home/1000/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/home/1000/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "/home/1000/.local/lib/python3.10/site-packages/urllib3/connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "/usr/local/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

The above exception was the direct cause of the following exception:

urllib3.exceptions.ProxyError: ('Unable to connect to proxy', RemoteDisconnected('Remote end closed connection without response'))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/1000/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/1000/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "/home/1000/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14/resolve/main/tokenizer_config.json (Caused by ProxyError('Unable to connect to proxy', RemoteDisconnected('Remote end closed connection without response')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/sd-scripts/sdxl_train_network.py", line 185, in <module>
    trainer.train(args)
  File "/app/sd-scripts/train_network.py", line 154, in train
    tokenizer = self.load_tokenizer(args)
  File "/app/sd-scripts/sdxl_train_network.py", line 53, in load_tokenizer
    tokenizer = sdxl_train_util.load_tokenizers(args)
  File "/app/sd-scripts/library/sdxl_train_util.py", line 147, in load_tokenizers
    tokenizer = CLIPTokenizer.from_pretrained(original_path)
  File "/home/1000/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1969, in from_pretrained
    resolved_config_file = cached_file(
  File "/home/1000/.local/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1238, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1631, in get_hf_file_metadata
    r = _request_wrapper(
  File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
  File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 408, in _request_wrapper
    response = get_session().request(method=method, url=url, **params)
  File "/home/1000/.local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/1000/.local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 67, in send
    return super().send(request, *args, **kwargs)
  File "/home/1000/.local/lib/python3.10/site-packages/requests/adapters.py", line 513, in send
    raise ProxyError(e, request=request)
requests.exceptions.ProxyError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14/resolve/main/tokenizer_config.json (Caused by ProxyError('Unable to connect to proxy', RemoteDisconnected('Remote end closed connection without response')))"), '(Request ID: 1cd0ba3c-9d26-4b30-8881-e46f1ad80288)')
2024-05-10 14:45:49 INFO     update token length: 225                         sdxl_train_util.py:159
                    INFO     Using DreamBooth method.                           train_network.py:172
2024-05-10 14:45:50 INFO     prepare images.                                      train_util.py:1572
                    INFO     found directory                                      train_util.py:1519
                             /dataset/lora/ruanmei/img/3_ruan_mei_(honkai_star_ra
                             il) 1girl contains 230 image files
                    INFO     found directory                                      train_util.py:1519
                             /dataset/lora/ruanmei/img/2_ruan_mei_(honkai_star_ra
                             il) 1girl contains 247 image files
                    INFO     found directory                                      train_util.py:1519
                             /dataset/lora/ruanmei/img/5_ruan_mei_(honkai_star_ra
                             il) 1girl contains 85 image files
                    INFO     found directory                                      train_util.py:1519
                             /dataset/lora/ruanmei/img/6_ruan_mei_(honkai_star_ra
                             il) 1girl contains 89 image files
                    INFO     2143 train images with repeating.                    train_util.py:1613
                    INFO     0 reg images.                                        train_util.py:1616
                    WARNING  no regularization images /                           train_util.py:1621
                             正則化画像が見つかりませんでした
[2024-05-10 14:45:50,157] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 704 closing signal SIGTERM
[2024-05-10 14:45:50,157] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 705 closing signal SIGTERM
[2024-05-10 14:45:50,157] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 706 closing signal SIGTERM
[2024-05-10 14:45:50,223] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 703) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/home/1000/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/app/sd-scripts/sdxl_train_network.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-10_14:45:50
  host      : 813dd376f19c
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 703)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
14:45:51-417026 INFO     Training has ended.

The following is from the 6x RTX3090, on docker. Since it is too long that exceed the character limits, I`m pasting the last error message.

accelerator device: cuda:5
2024-05-10 15:02:59 INFO     U-Net: <All keys matched     sdxl_model_util.py:202
                             successfully>
                    INFO     building text encoders       sdxl_model_util.py:205
2024-05-10 15:03:00 INFO     loading text encoders from   sdxl_model_util.py:258
                             checkpoint
                    INFO     text encoder 1: <All keys    sdxl_model_util.py:272
                             matched successfully>
2024-05-10 15:03:02 INFO     text encoder 2: <All keys    sdxl_model_util.py:276
                             matched successfully>
                    INFO     building VAE                 sdxl_model_util.py:279
2024-05-10 15:03:03 INFO     loading VAE from checkpoint  sdxl_model_util.py:284
                    INFO     VAE: <All keys matched       sdxl_model_util.py:287
                             successfully>
                    INFO     load VAE:                        model_util.py:1268
                             /dataset/vae/sdxl_vae.safetensor
                             s
2024-05-10 15:03:04 INFO     additional VAE loaded        sdxl_train_util.py:128
[2024-05-10 15:03:11,271] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 675 closing signal SIGTERM
[2024-05-10 15:03:11,435] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 672) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/home/1000/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
===================================================
/app/sd-scripts/sdxl_train_network.py FAILED
---------------------------------------------------
Failures:
[1]:
  time      : 2024-05-10_15:03:11
  host      : 507b354d3cab
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 673)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 673
[2]:
  time      : 2024-05-10_15:03:11
  host      : 507b354d3cab
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 674)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 674
[3]:
  time      : 2024-05-10_15:03:11
  host      : 507b354d3cab
  rank      : 4 (local_rank: 4)
  exitcode  : -7 (pid: 676)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 676
[4]:
  time      : 2024-05-10_15:03:11
  host      : 507b354d3cab
  rank      : 5 (local_rank: 5)
  exitcode  : -7 (pid: 677)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 677
---------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-10_15:03:11
  host      : 507b354d3cab
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 672)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 672
===================================================
15:03:13-135365 INFO     Training has ended.

Additional info Previously before the accelerate launch was introduced to the GUI, multi-GPU was working perfectly: all you need to do was just config accelerate and it will run smoothly. So I thought it might be a good idea to ignore the accelerate launch options, such as not check the muiti-GPU checkbox. But I was wrong, it will either run on single GPU, or just error. To check whether it is an individual case, I tried different machines and OS, but the error message is very similar: it is always the “torch.distributed.elastic.multiprocessing.errors.ChildFailedError" , the only difference is the “exitcode : 1 (pid: 703)” might be other numbers

Please help

zhchaoxing commented 3 months ago

I am not sure if this is the same issue or a different one, but yeah, trying Multi-GPU SDXL finetuning on the same linux machine and got errors and can not proceed.

caching latents...
 42%|████████████████████████████████████████████████████████████████████▎                                                                                               | 9588/23008 [27:23<32:18,  6.92it/s][2024-05-19 06:17:40,266] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1242468 closing signal SIGTERM
[2024-05-19 06:17:40,323] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1242469 closing signal SIGTERM
[2024-05-19 06:17:40,327] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1242471 closing signal SIGTERM
[2024-05-19 06:17:42,066] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 2 (pid: 1242470) of binary: /home/ubuntu/train/kohya_ss_22.4.1/venv/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/train/kohya_ss_22.4.1/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
./sdxl_train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-19_06:17:40
  host      : ubuntutrain
  rank      : 2 (local_rank: 2)
  exitcode  : -9 (pid: 1242470)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1242470
========================================================