Loading ResNet ArcFace
2023-11-08 20:42:35,648 - modelscope - INFO - load face enhancer model done
2023-11-08 20:42:36,212 - modelscope - INFO - load face detector model done
2023-11-08 20:42:36,879 - modelscope - INFO - load sr model done
2023-11-08 20:42:38,379 - modelscope - INFO - load fqa model done
0%| | 0/4 [00:00<?, ?it/s]2023-11-08 20:42:40,089 - modelscope - WARNING - task skin-retouching-torch input definition is missing
2023-11-08 20:42:42,267 - modelscope - WARNING - task skin-retouching-torch output keys are missing
2023-11-08 20:42:42,286 - modelscope - WARNING - task face_recognition input definition is missing
2023-11-08 20:42:42,734 - modelscope - INFO - model inference done
2023-11-08 20:42:42,735 - modelscope - WARNING - task face_recognition output keys are missing
25%|█████████████████████ | 1/4 [00:04<00:12, 4.31s/it]2023-11-08 20:42:44,475 - modelscope - INFO - model inference done
50%|██████████████████████████████████████████ | 2/4 [00:06<00:05, 2.80s/it]2023-11-08 20:42:46,394 - modelscope - INFO - model inference done
75%|███████████████████████████████████████████████████████████████ | 3/4 [00:07<00:02, 2.40s/it]2023-11-08 20:42:47,603 - modelscope - INFO - model inference done
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00, 2.30s/it]
selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\3.jpg total scores: 0.7654277769204004 face angles 0.9511747221187343
selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\2.jpg total scores: 0.7654277769204004 face angles 0.9511747221187343
selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\1.jpg total scores: 0.6966716841962143 face angles 0.9060542544318123
selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\0.jpg total scores: 0.6946112547174104 face angles 0.9798203341248489
jpg: 3.jpg face_id_scores 0.7654277769204004
jpg: 2.jpg face_id_scores 0.7654277769204004
jpg: 1.jpg face_id_scores 0.6966716841962143
jpg: 0.jpg face_id_scores 0.6946112547174104
4it [00:02, 1.43it/s]
save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\0.jpg
save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\1.jpg
save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\2.jpg
save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\3.jpg
train_file_path : D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya/train_lora.py
cache_log_file_path: D:\stable-diffusion-webui\outputs/easyphoto-tmp/train_kohya_log.txt
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 2
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in --num_processes=1.
--num_machines was set to a value of 1--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
2023-11-08 20:43:10,638 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found.
2023-11-08 20:43:10,643 - modelscope - INFO - TensorFlow version 2.14.0 Found.
2023-11-08 20:43:10,643 - modelscope - INFO - Loading ast index from C:\Users\lenovo.cache\modelscope\ast_indexer
2023-11-08 20:43:10,746 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found.
2023-11-08 20:43:10,752 - modelscope - INFO - TensorFlow version 2.14.0 Found.
2023-11-08 20:43:10,752 - modelscope - INFO - Loading ast index from C:\Users\lenovo.cache\modelscope\ast_indexer
2023-11-08 20:43:10,795 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 14287afabd2d935575bf06fb9806e116 and a total number of 943 components indexed
2023-11-08 20:43:10,908 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 14287afabd2d935575bf06fb9806e116 and a total number of 943 components indexed
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 1467, in
main()
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\utils\gpu_info.py", line 178, in wrapper
result = func(*args, kwargs)
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 826, in main
accelerator = Accelerator(
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 358, in init
self.state = AcceleratorState(
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 720, in init
PartialState(cpu, kwargs)
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 192, in init
torch.distributed.init_process_group(backend=self.backend, *kwargs)
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 1467, in
main()
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\utils\gpu_info.py", line 178, in wrapper
result = func(args, kwargs)
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 826, in main
accelerator = Accelerator(
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 358, in init
self.state = AcceleratorState(
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 720, in init
PartialState(cpu, kwargs)
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 192, in init
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17124) of binary: D:\stable-diffusion-webui\venv\Scripts\python.exe
Traceback (most recent call last):
File "C:\Users\lenovo\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\lenovo\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 989, in
main()
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 985, in main
launch_command(args)
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Loading ResNet ArcFace 2023-11-08 20:42:35,648 - modelscope - INFO - load face enhancer model done 2023-11-08 20:42:36,212 - modelscope - INFO - load face detector model done 2023-11-08 20:42:36,879 - modelscope - INFO - load sr model done 2023-11-08 20:42:38,379 - modelscope - INFO - load fqa model done 0%| | 0/4 [00:00<?, ?it/s]2023-11-08 20:42:40,089 - modelscope - WARNING - task skin-retouching-torch input definition is missing 2023-11-08 20:42:42,267 - modelscope - WARNING - task skin-retouching-torch output keys are missing 2023-11-08 20:42:42,286 - modelscope - WARNING - task face_recognition input definition is missing 2023-11-08 20:42:42,734 - modelscope - INFO - model inference done 2023-11-08 20:42:42,735 - modelscope - WARNING - task face_recognition output keys are missing 25%|█████████████████████ | 1/4 [00:04<00:12, 4.31s/it]2023-11-08 20:42:44,475 - modelscope - INFO - model inference done 50%|██████████████████████████████████████████ | 2/4 [00:06<00:05, 2.80s/it]2023-11-08 20:42:46,394 - modelscope - INFO - model inference done 75%|███████████████████████████████████████████████████████████████ | 3/4 [00:07<00:02, 2.40s/it]2023-11-08 20:42:47,603 - modelscope - INFO - model inference done 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00, 2.30s/it] selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\3.jpg total scores: 0.7654277769204004 face angles 0.9511747221187343 selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\2.jpg total scores: 0.7654277769204004 face angles 0.9511747221187343 selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\1.jpg total scores: 0.6966716841962143 face angles 0.9060542544318123 selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\0.jpg total scores: 0.6946112547174104 face angles 0.9798203341248489 jpg: 3.jpg face_id_scores 0.7654277769204004 jpg: 2.jpg face_id_scores 0.7654277769204004 jpg: 1.jpg face_id_scores 0.6966716841962143 jpg: 0.jpg face_id_scores 0.6946112547174104 4it [00:02, 1.43it/s] save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\0.jpg save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\1.jpg save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\2.jpg save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\3.jpg train_file_path : D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya/train_lora.py cache_log_file_path: D:\stable-diffusion-webui\outputs/easyphoto-tmp/train_kohya_log.txt The following values were not passed to
main()
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\utils\gpu_info.py", line 178, in wrapper
result = func(*args, kwargs)
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 826, in main
accelerator = Accelerator(
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 358, in init
self.state = AcceleratorState(
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 720, in init
PartialState(cpu, kwargs)
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 192, in init
torch.distributed.init_process_group(backend=self.backend, *kwargs)
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 1467, in
main()
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\utils\gpu_info.py", line 178, in wrapper
result = func( args, kwargs)
File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 826, in main
accelerator = Accelerator(
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 358, in init
self.state = AcceleratorState(
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 720, in init
PartialState(cpu, kwargs)
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 192, in init
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17124) of binary: D:\stable-diffusion-webui\venv\Scripts\python.exe
Traceback (most recent call last):
File "C:\Users\lenovo\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\lenovo\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 989, in
main()
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 985, in main
launch_command(args)
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
accelerate launch
and had defaults used instead:--num_processes
was set to a value of2
More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in--num_processes=1
.--num_machines
was set to a value of1
--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or runaccelerate config
. NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。). 2023-11-08 20:43:10,638 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found. 2023-11-08 20:43:10,643 - modelscope - INFO - TensorFlow version 2.14.0 Found. 2023-11-08 20:43:10,643 - modelscope - INFO - Loading ast index from C:\Users\lenovo.cache\modelscope\ast_indexer 2023-11-08 20:43:10,746 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found. 2023-11-08 20:43:10,752 - modelscope - INFO - TensorFlow version 2.14.0 Found. 2023-11-08 20:43:10,752 - modelscope - INFO - Loading ast index from C:\Users\lenovo.cache\modelscope\ast_indexer 2023-11-08 20:43:10,795 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 14287afabd2d935575bf06fb9806e116 and a total number of 943 components indexed 2023-11-08 20:43:10,908 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 14287afabd2d935575bf06fb9806e116 and a total number of 943 components indexed [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。). Traceback (most recent call last): File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 1467, inD:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya/train_lora.py FAILED
Failures: [1]: time : 2023-11-08_20:43:17 host : mlopt-workstation rank : 1 (local_rank: 1) exitcode : 1 (pid: 4936) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-11-08_20:43:17 host : mlopt-workstation rank : 0 (local_rank: 0) exitcode : 1 (pid: 17124) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Error executing the command: Command '['D:\stable-diffusion-webui\venv\Scripts\python.exe', '-m', 'accelerate.commands.launch', '--mixed_precision=fp16', '--main_process_port=3456', 'D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya/train_lora.py', '--pretrained_model_name_or_path=extensions\sd-webui-EasyPhoto\models\stable-diffusion-v1-5', '--pretrained_model_ckpt=models\Stable-diffusion\sd-v1-4.ckpt', '--train_data_dir=outputs\easyphoto-user-id-infos\dan\processed_images', '--caption_column=text', '--resolution=512', '--random_flip', '--train_batch_size=2', '--gradient_accumulation_steps=4', '--dataloader_num_workers=0', '--max_train_steps=200', '--checkpointing_steps=100', '--learning_rate=0.0001', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--train_text_encoder', '--seed=42', '--rank=128', '--network_alpha=64', '--validation_prompt=easyphoto_face, easyphoto, 1person', '--validation_steps=100', '--output_dir=outputs\easyphoto-user-id-infos\dan\user_weights', '--logging_dir=outputs\easyphoto-user-id-infos\dan\user_weights', '--enable_xformers_memory_efficient_attention', '--mixed_precision=fp16', '--template_dir=extensions\sd-webui-EasyPhoto\models\training_templates', '--template_mask', '--merge_best_lora_based_face_id', '--merge_best_lora_name=dan', '--cache_log_file=D:\stable-diffusion-webui\outputs/easyphoto-tmp/train_kohya_log.txt']' returned non-zero exit status 1. Using already loaded model sd-v1-4.ckpt [fe4efff1e1]: done in 1.4s (send model to device: 1.4s)
我有两张显卡,是这个原因吗?