aigc-apps / sd-webui-EasyPhoto

📷 EasyPhoto | Your Smart AI Photo Generator.
Apache License 2.0
4.95k stars 390 forks source link

Distributed package doesn't have NCCL built in #204

Open WalkerMe opened 12 months ago

WalkerMe commented 12 months ago
train_file_path :  D:\SD\webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya/train_lora.py
cache_log_file_path: D:\SD\webui\outputs/easyphoto-tmp/train_kohya_log.txt
NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [Xxxxx]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [Xxxxx]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
2023-10-25 05:41:18,772 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found.
2023-10-25 05:41:18,782 - modelscope - INFO - TensorFlow version 2.14.0 Found.
2023-10-25 05:41:18,782 - modelscope - INFO - Loading ast index from C:\Users\Xxxxx\.cache\modelscope\ast_indexer
2023-10-25 05:41:19,444 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 7f80729d76bd3e75c98a156305e0f8df and a total number of 943 components indexed
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [Xxxxx]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [Xxxxx]:3456 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
  File "D:\SD\webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 1467, in <module>
    main()
  File "D:\SD\webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\utils\gpu_info.py", line 178, in wrapper
    result = func(*args, **kwargs)
  File "D:\SD\webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya\train_lora.py", line 826, in main
    accelerator = Accelerator(
  File "D:\SD\webui\venv\lib\site-packages\accelerate\accelerator.py", line 358, in __init__
    self.state = AcceleratorState(
  File "D:\SD\webui\venv\lib\site-packages\accelerate\state.py", line 720, in __init__
    PartialState(cpu, **kwargs)
  File "D:\SD\webui\venv\lib\site-packages\accelerate\state.py", line 192, in __init__
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "D:\SD\webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
    default_pg = _new_process_group_helper(
  File "D:\SD\webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11164) of binary: D:\SD\webui\venv\Scripts\python.exe
Traceback (most recent call last):
  File "runpy.py", line 196, in _run_module_as_main
  File "runpy.py", line 86, in _run_code
  File "D:\SD\webui\venv\lib\site-packages\accelerate\commands\launch.py", line 989, in <module>
    main()
  File "D:\SD\webui\venv\lib\site-packages\accelerate\commands\launch.py", line 985, in main
    launch_command(args)
  File "D:\SD\webui\venv\lib\site-packages\accelerate\commands\launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "D:\SD\webui\venv\lib\site-packages\accelerate\commands\launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "D:\SD\webui\venv\lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "D:\SD\webui\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "D:\SD\webui\venv\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
D:\SD\webui\extensions\sd-webui-EasyPhoto\scripts\train_kohya/train_lora.py FAILED

Failed to obtain Lora after training, please check the training process.

开始训练后,出现上面的报错


同时,启用EasyPhoto后无法正常退出SD,退出时有报错:

C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:2829:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit
wuziheng commented 12 months ago

C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:2829: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

这个问题可以忽略,无影响

wuziheng commented 12 months ago

确认一下你的机器只有一张GPU 是么?

WalkerMe commented 12 months ago

确认一下你的机器只有一张GPU 是么?

我是在自己电脑上跑的,独显只有一张

WalkerMe commented 11 months ago

C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:2829: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

这个问题可以忽略,无影响

这个也不能说无影响,我在远程机器上开的SD,装了关闭的插件,不用的时候可以直接结束进程释放内存、现存资源。有这个报错后,进程无法结束,只能点击关闭cmd窗口才能释放资源。