bmaltais / kohya_ss

Apache License 2.0
9.27k stars 1.2k forks source link

Multi GPU training not working #2646

Open Rodmuzik opened 1 month ago

Rodmuzik commented 1 month ago

I'm using two A6000 with NVlink for training, Win11

截圖 2024-07-16 晚上8 03 45

but will show error

[2024-07-16 20:04:51,022] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. Traceback (most recent call last): File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\kohya_ss\venv\Scripts\accelerate.EXE__main__.py", line 7, in File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1008, in launch_command multi_gpu_launcher(args) File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher distrib_run.run(args) File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 255, in launch_agent result = agent.run() File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper result = f(*args, kwargs) File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 736, in run result = self._invoke_run(role) File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 871, in _invoke_run self._initialize_workers(self._worker_group) File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper result = f(*args, *kwargs) File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 705, in _initialize_workers self._rendezvous(worker_group) File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper result = f(args, kwargs) File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 546, in _rendezvous store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous() File "D:\kohya_ss\venv\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 54, in next_rendezvous self._store = TCPStore( # type: ignore[call-arg] RuntimeError: unmatched '}' in format string 20:04:54-280439 INFO Training has ended.

截圖 2024-07-16 晚上8 05 40

plz help, thx a lot!!! : )

Rodmuzik commented 1 month ago

anyone can help? : )

quzopl commented 1 month ago

here is the answer ans use linux system