Open yisaaier opened 9 months ago
raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed
老问题了。一看就是windows环境下炼丹了,把NCCL改GLOO就可以了,有一个叫做搜索引擎的东西很好用,它的名字叫做“百度”,去找找看怎样把nccl改gloo吧~
NCCL改GLOO
倒是不报错了,但是只有一张显卡在工作
NCCL改GLOO
倒是不报错了,但是只有一张显卡在工作
NCCL改GLOO
倒是不报错了,但是只有一张显卡在工作
301
ok解决了 3q
修了,但是nccl的事情我想办法处理一下,这个issue先别关
修了,但是nccl的事情我想办法处理一下,这个issue先别关
秋叶佬是打算做脚本把win环境下改gloo么? 如果硬刚win 环境nccl,前几天看到一个还不错的本子实践教学,直接啃生肉 reference https://www.kkaneko.jp/tools/win/nccl.html
ガイドに従ってdllファイルの作成について、確かにできる、ぜひやってみよう。 但有一点,编译出来的dll要怎么迁移才能让torch用上?还有什么依赖需要补充? 感觉还是要搞得太多,gloo虽然只有 两个可用,但......应该能......满足需求吧 当然对于萌新的我选择了摆烂——WSL大法好!
打算用 gloo
大佬,我遇到另一個報錯
NOTE: Redirects are currently not supported in Windows or MacOs.
Using RTX 3090 or 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\accelerate\commands\launch.py", line 1027, in
我是改了train.ps1文件再跑的,multi_gpu是0的時候能跑,改2後就這樣,怎麼辦?
NCCL改GLOO
倒是不报错了,但是只有一张显卡在工作
你好,请问是在哪里修改呢?
18:07:49-364487 INFO Wrote promopts to file D:\ai\lora-scripts-v1.5.1\config\autosave\20231208-180749-promopt.txt 18:07:49-376461 INFO Training started with config file / 训练开始,使用配置文件: D:\ai\lora-scripts-v1.5.1\config\autosave\20231208-180749.toml 18:07:49-392459 INFO Task 7cd32e17-8fdc-45f2-b7d8-e8a68e22c741 created NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-S3F14MD]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-S3F14MD]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). Loading settings from D:\ai\lora-scripts-v1.5.1\config\autosave\20231208-180749.toml... D:\ai\lora-scripts-v1.5.1\config\autosave\20231208-180749 prepare tokenizer update token length: 255 Using DreamBooth method. prepare images. found directory D:\teriri\1_teriri contains 549 image files 549 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 3 resolution: (1080, 960) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1920 bucket_reso_steps: 64 bucket_no_upscale: False
[Subset 0 of Dataset 0] image_dir: "D:\teriri\1_teriri" image_count: 549 num_repeats: 1 shuffle_caption: True keep_tokens: 3 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: teriri caption_extension: .txt
[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████| 549/549 [00:00<00:00, 2889.39it/s] make buckets number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (768, 1216), count: 11 bucket 1: resolution (832, 1152), count: 33 bucket 2: resolution (896, 1088), count: 10 bucket 3: resolution (1024, 960), count: 1 bucket 4: resolution (1088, 896), count: 1 bucket 5: resolution (1152, 832), count: 8 bucket 6: resolution (1216, 768), count: 3 bucket 7: resolution (1280, 768), count: 481 bucket 8: resolution (1344, 704), count: 1 mean ar error (without repeats): 0.09992365997427935 preparing accelerator [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-S3F14MD]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-S3F14MD]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). Traceback (most recent call last): File "D:\ai\lora-scripts-v1.5.1\sd-scripts\train_network.py", line 1009, in
trainer.train(args)
File "D:\ai\lora-scripts-v1.5.1\sd-scripts\train_network.py", line 216, in train
accelerator = train_util.prepare_accelerator(args)
File "D:\ai\lora-scripts-v1.5.1\sd-scripts\library\train_util.py", line 3784, in prepare_accelerator
accelerator = Accelerator(
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\accelerator.py", line 369, in init
self.state = AcceleratorState(
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\state.py", line 732, in init
PartialState(cpu, kwargs)
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\state.py", line 202, in init
torch.distributed.init_process_group(backend=self.backend, kwargs)
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_group
default_pg = _new_process_group_helper(
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 998, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9608) of binary: D:\ai\lora-scripts-v1.5.1\python\python.exe
Traceback (most recent call last):
File "D:\ai\lora-scripts-v1.5.1\python\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\ai\lora-scripts-v1.5.1\python\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\commands\launch.py", line 996, in
main()
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\commands\launch.py", line 992, in main
launch_command(args)
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\commands\launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\accelerate\commands\launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\ai\lora-scripts-v1.5.1\python\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./sd-scripts/train_network.py FAILED
Failures: