使用目录内的 python 进行启动....
18:50:23-866005 INFO Windows Python 3.10.11 D:\lora-scripts-v1.6.2\python\python.exe
18:50:23-873005 INFO detected locale zh_CN, use pip mirrors
18:50:26-535903 INFO Torch 2.0.0+cu118
Torch backend: nVidia CUDA 11.8 cuDNN 8700
Torch detected GPU: NVIDIA GeForce RTX 3060 Laptop GPU VRAM 12287 Arch (8, 6) Cores 30
Torch detected GPU: Tesla P40 VRAM 22945 Arch (6, 1) Cores 30
18:50:26-562887 INFO Starting tensorboard...
18:50:26-646802 INFO Server started at http://127.0.0.1:28000
TensorBoard 2.10.1 at http://127.0.0.1:6006/ (Press CTRL+C to quit)
18:55:10-742662 WARNING No subdir found in data dir
18:55:10-746256 WARNING No leagal dataset found. Try find avaliable images
18:55:10-751251 INFO 30 images found, 0 captions found
18:55:10-767242 INFO Auto dataset created D:/lora-scripts-v1.6.2/train/5_zkz\5_zkz
18:55:10-773240 INFO Training started with config file / 训练开始,使用配置文件:
D:\lora-scripts-v1.6.2\config\autosave\20231012-185510.toml
18:55:10-780262 INFO Task 702ef635-db27-4db8-91e3-6c2a2b7404d5 created
NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Loading settings from D:\lora-scripts-v1.6.2\config\autosave\20231012-185510.toml...
D:\lora-scripts-v1.6.2\config\autosave\20231012-185510
prepare tokenizers
Downloading (…)olve/main/vocab.json: 100%|██████████████████████████████████████████| 961k/961k [00:56<00:00, 17.1kB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████| 525k/525k [00:31<00:00, 16.6kB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████| 389/389 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████| 905/905 [00:00<?, ?B/s]
update token length: 255
Using DreamBooth method.
prepare images.
found directory D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz contains 30 image files
No caption file found for 30 images. Training will continue without captions for these images. If class token exists, it will be used. / 30枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を 続行します。class tokenが存在する場合はそれを使います。
D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00024-2750517853.png
D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00025-2750517854.png
D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00026-2750517855.png
D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00027-4117931403.png
D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00028-2719553094.png
D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00029-1670813856.png... and 25 more
150 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
batch_size: 1
resolution: (256, 256)
enable_bucket: True
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 32
bucket_no_upscale: False
[Dataset 0]
loading image sizes.
100%|████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 3002.51it/s]
make buckets
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (224, 288), count: 150
mean ar error (without repeats): 0.02777777777777779
clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません
preparing accelerator
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
File "D:\lora-scripts-v1.6.2\sd-scripts\sdxl_train_network.py", line 183, in
trainer.train(args)
File "D:\lora-scripts-v1.6.2\sd-scripts\train_network.py", line 216, in train
accelerator = train_util.prepare_accelerator(args)
File "D:\lora-scripts-v1.6.2\sd-scripts\library\train_util.py", line 3784, in prepare_accelerator
accelerator = Accelerator(
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\accelerator.py", line 369, in init
self.state = AcceleratorState(
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\state.py", line 732, in init
PartialState(cpu, kwargs)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\state.py", line 202, in init
torch.distributed.init_process_group(backend=self.backend, kwargs)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_group
default_pg = _new_process_group_helper(
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 998, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13140) of binary: D:\lora-scripts-v1.6.2\python\python.exe
Traceback (most recent call last):
File "D:\lora-scripts-v1.6.2\python\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\lora-scripts-v1.6.2\python\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 996, in
main()
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 992, in main
launch_command(args)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./sd-scripts/sdxl_train_network.py FAILED
Failures:
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-10-12_18:57:06
host : USER-20230706TY
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 13140)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
18:57:07-236772 ERROR Training failed / 训练失败
请问多卡训练参数如何才能生效?另能否可以像秋叶启动器一样能选GPU,我想设GPU1的P40训练,请教如何设置?
日志显示错误如下:
使用目录内的 python 进行启动.... 18:50:23-866005 INFO Windows Python 3.10.11 D:\lora-scripts-v1.6.2\python\python.exe 18:50:23-873005 INFO detected locale zh_CN, use pip mirrors 18:50:26-535903 INFO Torch 2.0.0+cu118 Torch backend: nVidia CUDA 11.8 cuDNN 8700 Torch detected GPU: NVIDIA GeForce RTX 3060 Laptop GPU VRAM 12287 Arch (8, 6) Cores 30 Torch detected GPU: Tesla P40 VRAM 22945 Arch (6, 1) Cores 30 18:50:26-562887 INFO Starting tensorboard... 18:50:26-646802 INFO Server started at http://127.0.0.1:28000 TensorBoard 2.10.1 at http://127.0.0.1:6006/ (Press CTRL+C to quit) 18:55:10-742662 WARNING No subdir found in data dir 18:55:10-746256 WARNING No leagal dataset found. Try find avaliable images 18:55:10-751251 INFO 30 images found, 0 captions found 18:55:10-767242 INFO Auto dataset created D:/lora-scripts-v1.6.2/train/5_zkz\5_zkz 18:55:10-773240 INFO Training started with config file / 训练开始,使用配置文件: D:\lora-scripts-v1.6.2\config\autosave\20231012-185510.toml 18:55:10-780262 INFO Task 702ef635-db27-4db8-91e3-6c2a2b7404d5 created NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). Loading settings from D:\lora-scripts-v1.6.2\config\autosave\20231012-185510.toml... D:\lora-scripts-v1.6.2\config\autosave\20231012-185510 prepare tokenizers Downloading (…)olve/main/vocab.json: 100%|██████████████████████████████████████████| 961k/961k [00:56<00:00, 17.1kB/s] Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████| 525k/525k [00:31<00:00, 16.6kB/s] Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████| 389/389 [00:00<?, ?B/s] Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████| 905/905 [00:00<?, ?B/s] update token length: 255 Using DreamBooth method. prepare images. found directory D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz contains 30 image files No caption file found for 30 images. Training will continue without captions for these images. If class token exists, it will be used. / 30枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を 続行します。class tokenが存在する場合はそれを使います。 D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00024-2750517853.png D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00025-2750517854.png D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00026-2750517855.png D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00027-4117931403.png D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00028-2719553094.png D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00029-1670813856.png... and 25 more 150 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 1 resolution: (256, 256) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 32 bucket_no_upscale: False
[Subset 0 of Dataset 0] image_dir: "D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz" image_count: 30 num_repeats: 5 shuffle_caption: True keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: zkz caption_extension: .txt
[Dataset 0] loading image sizes. 100%|████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 3002.51it/s] make buckets number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (224, 288), count: 150 mean ar error (without repeats): 0.02777777777777779 clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません preparing accelerator [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). Traceback (most recent call last): File "D:\lora-scripts-v1.6.2\sd-scripts\sdxl_train_network.py", line 183, in
trainer.train(args)
File "D:\lora-scripts-v1.6.2\sd-scripts\train_network.py", line 216, in train
accelerator = train_util.prepare_accelerator(args)
File "D:\lora-scripts-v1.6.2\sd-scripts\library\train_util.py", line 3784, in prepare_accelerator
accelerator = Accelerator(
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\accelerator.py", line 369, in init
self.state = AcceleratorState(
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\state.py", line 732, in init
PartialState(cpu, kwargs)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\state.py", line 202, in init
torch.distributed.init_process_group(backend=self.backend, kwargs)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_group
default_pg = _new_process_group_helper(
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 998, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13140) of binary: D:\lora-scripts-v1.6.2\python\python.exe
Traceback (most recent call last):
File "D:\lora-scripts-v1.6.2\python\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\lora-scripts-v1.6.2\python\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 996, in
main()
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 992, in main
launch_command(args)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./sd-scripts/sdxl_train_network.py FAILED
Failures: