aigc-apps / sd-webui-EasyPhoto

📷 EasyPhoto | Your Smart AI Photo Generator.
Apache License 2.0
4.95k stars 390 forks source link

训练错误 #244

Open bent1e opened 11 months ago

bent1e commented 11 months ago

Loading ResNet ArcFace 2023-11-08 20:42:35,648 - modelscope - INFO - load face enhancer model done 2023-11-08 20:42:36,212 - modelscope - INFO - load face detector model done 2023-11-08 20:42:36,879 - modelscope - INFO - load sr model done 2023-11-08 20:42:38,379 - modelscope - INFO - load fqa model done 0%| | 0/4 [00:00<?, ?it/s]2023-11-08 20:42:40,089 - modelscope - WARNING - task skin-retouching-torch input definition is missing 2023-11-08 20:42:42,267 - modelscope - WARNING - task skin-retouching-torch output keys are missing 2023-11-08 20:42:42,286 - modelscope - WARNING - task face_recognition input definition is missing 2023-11-08 20:42:42,734 - modelscope - INFO - model inference done 2023-11-08 20:42:42,735 - modelscope - WARNING - task face_recognition output keys are missing 25%|█████████████████████ | 1/4 [00:04<00:12, 4.31s/it]2023-11-08 20:42:44,475 - modelscope - INFO - model inference done 50%|██████████████████████████████████████████ | 2/4 [00:06<00:05, 2.80s/it]2023-11-08 20:42:46,394 - modelscope - INFO - model inference done 75%|███████████████████████████████████████████████████████████████ | 3/4 [00:07<00:02, 2.40s/it]2023-11-08 20:42:47,603 - modelscope - INFO - model inference done 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00, 2.30s/it] selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\3.jpg total scores: 0.7654277769204004 face angles 0.9511747221187343 selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\2.jpg total scores: 0.7654277769204004 face angles 0.9511747221187343 selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\1.jpg total scores: 0.6966716841962143 face angles 0.9060542544318123 selected paths: D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\original_backup\0.jpg total scores: 0.6946112547174104 face angles 0.9798203341248489 jpg: 3.jpg face_id_scores 0.7654277769204004 jpg: 2.jpg face_id_scores 0.7654277769204004 jpg: 1.jpg face_id_scores 0.6966716841962143 jpg: 0.jpg face_id_scores 0.6946112547174104 4it [00:02, 1.43it/s] save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\0.jpg save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\1.jpg save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\2.jpg save processed image to D:\stable-diffusion-webui\outputs/easyphoto-user-id-infos\dan\processed_images\train\3.jpg train_file_path : D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya/train_lora.py cache_log_file_path: D:\stable-diffusion-webui\outputs/easyphoto-tmp/train_kohya_log.txt The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 2 More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in --num_processes=1. --num_machines was set to a value of 1 --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。). 2023-11-08 20:43:10,638 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found. 2023-11-08 20:43:10,643 - modelscope - INFO - TensorFlow version 2.14.0 Found. 2023-11-08 20:43:10,643 - modelscope - INFO - Loading ast index from C:\Users\lenovo.cache\modelscope\ast_indexer 2023-11-08 20:43:10,746 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found. 2023-11-08 20:43:10,752 - modelscope - INFO - TensorFlow version 2.14.0 Found. 2023-11-08 20:43:10,752 - modelscope - INFO - Loading ast index from C:\Users\lenovo.cache\modelscope\ast_indexer 2023-11-08 20:43:10,795 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 14287afabd2d935575bf06fb9806e116 and a total number of 943 components indexed 2023-11-08 20:43:10,908 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 14287afabd2d935575bf06fb9806e116 and a total number of 943 components indexed [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。). Traceback (most recent call last): File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 1467, in main() File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\utils\gpu_info.py", line 178, in wrapper result = func(*args, kwargs) File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 826, in main accelerator = Accelerator( File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 358, in init self.state = AcceleratorState( File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 720, in init PartialState(cpu, kwargs) File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 192, in init torch.distributed.init_process_group(backend=self.backend, *kwargs) File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group default_pg = _new_process_group_helper( File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [mlopt-workstation]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。). Traceback (most recent call last): File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 1467, in main() File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\utils\gpu_info.py", line 178, in wrapper result = func(args, kwargs) File "D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 826, in main accelerator = Accelerator( File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 358, in init self.state = AcceleratorState( File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 720, in init PartialState(cpu, kwargs) File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\state.py", line 192, in init torch.distributed.init_process_group(backend=self.backend, **kwargs) File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group default_pg = _new_process_group_helper( File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17124) of binary: D:\stable-diffusion-webui\venv\Scripts\python.exe Traceback (most recent call last): File "C:\Users\lenovo\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\lenovo\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 989, in main() File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 985, in main launch_command(args) File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 970, in launch_command multi_gpu_launcher(args) File "D:\stable-diffusion-webui\venv\lib\site-packages\accelerate\commands\launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "D:\stable-diffusion-webui\venv\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya/train_lora.py FAILED

Failures: [1]: time : 2023-11-08_20:43:17 host : mlopt-workstation rank : 1 (local_rank: 1) exitcode : 1 (pid: 4936) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-11-08_20:43:17 host : mlopt-workstation rank : 0 (local_rank: 0) exitcode : 1 (pid: 17124) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Error executing the command: Command '['D:\stable-diffusion-webui\venv\Scripts\python.exe', '-m', 'accelerate.commands.launch', '--mixed_precision=fp16', '--main_process_port=3456', 'D:\stable-diffusion-webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya/train_lora.py', '--pretrained_model_name_or_path=extensions\sd-webui-EasyPhoto\models\stable-diffusion-v1-5', '--pretrained_model_ckpt=models\Stable-diffusion\sd-v1-4.ckpt', '--train_data_dir=outputs\easyphoto-user-id-infos\dan\processed_images', '--caption_column=text', '--resolution=512', '--random_flip', '--train_batch_size=2', '--gradient_accumulation_steps=4', '--dataloader_num_workers=0', '--max_train_steps=200', '--checkpointing_steps=100', '--learning_rate=0.0001', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--train_text_encoder', '--seed=42', '--rank=128', '--network_alpha=64', '--validation_prompt=easyphoto_face, easyphoto, 1person', '--validation_steps=100', '--output_dir=outputs\easyphoto-user-id-infos\dan\user_weights', '--logging_dir=outputs\easyphoto-user-id-infos\dan\user_weights', '--enable_xformers_memory_efficient_attention', '--mixed_precision=fp16', '--template_dir=extensions\sd-webui-EasyPhoto\models\training_templates', '--template_mask', '--merge_best_lora_based_face_id', '--merge_best_lora_name=dan', '--cache_log_file=D:\stable-diffusion-webui\outputs/easyphoto-tmp/train_kohya_log.txt']' returned non-zero exit status 1. Using already loaded model sd-v1-4.ckpt [fe4efff1e1]: done in 1.4s (send model to device: 1.4s)

我有两张显卡,是这个原因吗?

wuziheng commented 11 months ago

确实是,accelerate 在部分版本下多卡会挂,请在运行前加上 CUDA_VISIBLE_DEVICES=0 或者1应该能解决问题,我也遇到过。如果有用,请关闭issue

bent1e commented 11 months ago

请问是在那个代码前面加呢?我在train_lora.py前面加了也还是不行

wuziheng commented 10 months ago

加在所有的启动环境之前?