Akegarasu / lora-scripts

LoRA & Dreambooth training scripts & GUI use kohya-ss's trainer, for diffusion model.
GNU Affero General Public License v3.0
4.38k stars 540 forks source link

开启多gpu训练报错 #308

Open yisaaier opened 9 months ago

yisaaier commented 9 months ago

18:07:49-364487 INFO Wrote promopts to file D:\ai\lora-scripts-v1.5.1\config\autosave\20231208-180749-promopt.txt 18:07:49-376461 INFO Training started with config file / 训练开始,使用配置文件: D:\ai\lora-scripts-v1.5.1\config\autosave\20231208-180749.toml 18:07:49-392459 INFO Task 7cd32e17-8fdc-45f2-b7d8-e8a68e22c741 created NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-S3F14MD]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-S3F14MD]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). Loading settings from D:\ai\lora-scripts-v1.5.1\config\autosave\20231208-180749.toml... D:\ai\lora-scripts-v1.5.1\config\autosave\20231208-180749 prepare tokenizer update token length: 255 Using DreamBooth method. prepare images. found directory D:\teriri\1_teriri contains 549 image files 549 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 3 resolution: (1080, 960) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1920 bucket_reso_steps: 64 bucket_no_upscale: False

[Subset 0 of Dataset 0] image_dir: "D:\teriri\1_teriri" image_count: 549 num_repeats: 1 shuffle_caption: True keep_tokens: 3 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: teriri caption_extension: .txt

[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████| 549/549 [00:00<00:00, 2889.39it/s] make buckets number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (768, 1216), count: 11 bucket 1: resolution (832, 1152), count: 33 bucket 2: resolution (896, 1088), count: 10 bucket 3: resolution (1024, 960), count: 1 bucket 4: resolution (1088, 896), count: 1 bucket 5: resolution (1152, 832), count: 8 bucket 6: resolution (1216, 768), count: 3 bucket 7: resolution (1280, 768), count: 481 bucket 8: resolution (1344, 704), count: 1 mean ar error (without repeats): 0.09992365997427935 preparing accelerator [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-S3F14MD]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-S3F14MD]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). Traceback (most recent call last): File "D:\ai\lora-scripts-v1.5.1\sd-scripts\train_network.py", line 1009, in trainer.train(args) File "D:\ai\lora-scripts-v1.5.1\sd-scripts\train_network.py", line 216, in train accelerator = train_util.prepare_accelerator(args) File "D:\ai\lora-scripts-v1.5.1\sd-scripts\library\train_util.py", line 3784, in prepare_accelerator accelerator = Accelerator( File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\accelerator.py", line 369, in init self.state = AcceleratorState( File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\state.py", line 732, in init PartialState(cpu, kwargs) File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\state.py", line 202, in init torch.distributed.init_process_group(backend=self.backend, kwargs) File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_group default_pg = _new_process_group_helper( File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 998, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9608) of binary: D:\ai\lora-scripts-v1.5.1\python\python.exe Traceback (most recent call last): File "D:\ai\lora-scripts-v1.5.1\python\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\ai\lora-scripts-v1.5.1\python\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\commands\launch.py", line 996, in main() File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\commands\launch.py", line 992, in main launch_command(args) File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\commands\launch.py", line 977, in launch_command multi_gpu_launcher(args) File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\commands\launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./sd-scripts/train_network.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-12-08_18:08:27 host : DESKTOP-S3F14MD rank : 0 (local_rank: 0) exitcode : 1 (pid: 9608) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 18:08:52-511833 ERROR Training failed / 训练失败
jpysina commented 9 months ago

raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed 老问题了。一看就是windows环境下炼丹了,把NCCL改GLOO就可以了,有一个叫做搜索引擎的东西很好用,它的名字叫做“百度”,去找找看怎样把nccl改gloo吧~ image

yisaaier commented 9 months ago

NCCL改GLOO

倒是不报错了,但是只有一张显卡在工作

jpysina commented 9 months ago

NCCL改GLOO

倒是不报错了,但是只有一张显卡在工作

301

yisaaier commented 9 months ago

NCCL改GLOO

倒是不报错了,但是只有一张显卡在工作

301

ok解决了 3q

Akegarasu commented 9 months ago

修了,但是nccl的事情我想办法处理一下,这个issue先别关

jpysina commented 9 months ago

修了,但是nccl的事情我想办法处理一下,这个issue先别关

秋叶佬是打算做脚本把win环境下改gloo么? 如果硬刚win 环境nccl,前几天看到一个还不错的本子实践教学,直接啃生肉 reference https://www.kkaneko.jp/tools/win/nccl.html

ガイドに従ってdllファイルの作成について、確かにできる、ぜひやってみよう。 但有一点,编译出来的dll要怎么迁移才能让torch用上?还有什么依赖需要补充? 感觉还是要搞得太多,gloo虽然只有 两个可用,但......应该能......满足需求吧 当然对于萌新的我选择了摆烂——WSL大法好!

Akegarasu commented 9 months ago

打算用 gloo

CCJetWing commented 9 months ago

大佬,我遇到另一個報錯

NOTE: Redirects are currently not supported in Windows or MacOs. Using RTX 3090 or 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled. Traceback (most recent call last): File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\NEW\lora-scripts\Venv\lib\site-packages\accelerate\commands\launch.py", line 1027, in main() File "D:\NEW\lora-scripts\Venv\lib\site-packages\accelerate\commands\launch.py", line 1023, in main launch_command(args) File "D:\NEW\lora-scripts\Venv\lib\site-packages\accelerate\commands\launch.py", line 1008, in launch_command multi_gpu_launcher(args) File "D:\NEW\lora-scripts\Venv\lib\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher distrib_run.run(args) File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\run.py", line 786, in run elastic_launch( File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\launcher\api.py", line 241, in launch_agent result = agent.run() File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper result = f(*args, kwargs) File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 723, in run result = self._invoke_run(role) File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 858, in _invoke_run self._initialize_workers(self._worker_group) File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper result = f(*args, *kwargs) File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 692, in _initialize_workers self._rendezvous(worker_group) File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper result = f(args, kwargs) File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 546, in _rendezvous store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous() File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous self._store = TCPStore( # type: ignore[call-arg] RuntimeError: unmatched '}' in format string

我是改了train.ps1文件再跑的,multi_gpu是0的時候能跑,改2後就這樣,怎麼辦?

Huangdebo commented 6 months ago

NCCL改GLOO

倒是不报错了,但是只有一张显卡在工作

你好,请问是在哪里修改呢?