aigc-apps / sd-webui-EasyPhoto

📷 EasyPhoto | Your Smart AI Photo Generator.
Apache License 2.0
4.96k stars 391 forks source link

[单卡训练无错误] 训练报错,Failed to obtain Lora after training, please check the training process. #62

Closed tankzhangying closed 1 year ago

tankzhangying commented 1 year ago

0it [00:00, ?it/s]2023-09-11 15:31:47,037 - modelscope - WARNING - task skin-retouching-torch input definition is missing 2023-09-11 15:31:47,737 - modelscope - WARNING - task skin-retouching-torch output keys are missing 4it [00:03, 1.29it/s] sh: 2: accelerate: not found

bubbliiiing commented 1 year ago

我们会在下个版本保证运行的accelerate环境和python环境一致,这样应该不会有这个问题,今天内应该能解决

tankzhangying commented 1 year ago

Traceback (most recent call last): File "/home/t8/.conda/envs/py310/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/t8/.conda/envs/py310/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/ai/t8/webui-easyphoto/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 989, in main() File "/ai/t8/webui-easyphoto/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in main launch_command(args) File "/ai/t8/webui-easyphoto/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command simple_launcher(args) File "/ai/t8/webui-easyphoto/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher

还是accelerate的问题

bubbliiiing commented 1 year ago

Are there any remaining issues? We should have maintained the environment with both accelerate and Python across multiple versions.

fightingman1 commented 1 year ago

用的是docker,还是报这个错

fightingman1 commented 1 year ago

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1820 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1823 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 1818) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 989, in main() File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 985, in main launch_command(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 970, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/workspace/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora.py FAILED

fightingman1 commented 1 year ago

单个gpu训练好像就可以

wuziheng commented 1 year ago

多卡确实还没有明确的官方支持,暂时是支持单卡。