Akegarasu / lora-scripts

LoRA & Dreambooth training scripts & GUI use kohya-ss's trainer, for diffusion model.
GNU Affero General Public License v3.0
4.43k stars 552 forks source link

请问多卡训练参数如何才能生效?另能否可以想秋叶启动器一样可以选GPU的,我想设GPU1的P40训练,请教如何设置? #264

Closed newstargo closed 11 months ago

newstargo commented 11 months ago

请问多卡训练参数如何才能生效?另能否可以像秋叶启动器一样能选GPU,我想设GPU1的P40训练,请教如何设置?

日志显示错误如下:

使用目录内的 python 进行启动.... 18:50:23-866005 INFO Windows Python 3.10.11 D:\lora-scripts-v1.6.2\python\python.exe 18:50:23-873005 INFO detected locale zh_CN, use pip mirrors 18:50:26-535903 INFO Torch 2.0.0+cu118 Torch backend: nVidia CUDA 11.8 cuDNN 8700 Torch detected GPU: NVIDIA GeForce RTX 3060 Laptop GPU VRAM 12287 Arch (8, 6) Cores 30 Torch detected GPU: Tesla P40 VRAM 22945 Arch (6, 1) Cores 30 18:50:26-562887 INFO Starting tensorboard... 18:50:26-646802 INFO Server started at http://127.0.0.1:28000 TensorBoard 2.10.1 at http://127.0.0.1:6006/ (Press CTRL+C to quit) 18:55:10-742662 WARNING No subdir found in data dir 18:55:10-746256 WARNING No leagal dataset found. Try find avaliable images 18:55:10-751251 INFO 30 images found, 0 captions found 18:55:10-767242 INFO Auto dataset created D:/lora-scripts-v1.6.2/train/5_zkz\5_zkz 18:55:10-773240 INFO Training started with config file / 训练开始,使用配置文件: D:\lora-scripts-v1.6.2\config\autosave\20231012-185510.toml 18:55:10-780262 INFO Task 702ef635-db27-4db8-91e3-6c2a2b7404d5 created NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). Loading settings from D:\lora-scripts-v1.6.2\config\autosave\20231012-185510.toml... D:\lora-scripts-v1.6.2\config\autosave\20231012-185510 prepare tokenizers Downloading (…)olve/main/vocab.json: 100%|██████████████████████████████████████████| 961k/961k [00:56<00:00, 17.1kB/s] Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████| 525k/525k [00:31<00:00, 16.6kB/s] Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████| 389/389 [00:00<?, ?B/s] Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████| 905/905 [00:00<?, ?B/s] update token length: 255 Using DreamBooth method. prepare images. found directory D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz contains 30 image files No caption file found for 30 images. Training will continue without captions for these images. If class token exists, it will be used. / 30枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を 続行します。class tokenが存在する場合はそれを使います。 D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00024-2750517853.png D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00025-2750517854.png D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00026-2750517855.png D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00027-4117931403.png D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00028-2719553094.png D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz\00029-1670813856.png... and 25 more 150 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 1 resolution: (256, 256) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 32 bucket_no_upscale: False

[Subset 0 of Dataset 0] image_dir: "D:\lora-scripts-v1.6.2\train\5_zkz\5_zkz" image_count: 30 num_repeats: 5 shuffle_caption: True keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: zkz caption_extension: .txt

[Dataset 0] loading image sizes. 100%|████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 3002.51it/s] make buckets number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (224, 288), count: 150 mean ar error (without repeats): 0.02777777777777779 clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません preparing accelerator [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [USER-20230706TY]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). Traceback (most recent call last): File "D:\lora-scripts-v1.6.2\sd-scripts\sdxl_train_network.py", line 183, in trainer.train(args) File "D:\lora-scripts-v1.6.2\sd-scripts\train_network.py", line 216, in train accelerator = train_util.prepare_accelerator(args) File "D:\lora-scripts-v1.6.2\sd-scripts\library\train_util.py", line 3784, in prepare_accelerator accelerator = Accelerator( File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\accelerator.py", line 369, in init self.state = AcceleratorState( File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\state.py", line 732, in init PartialState(cpu, kwargs) File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\state.py", line 202, in init torch.distributed.init_process_group(backend=self.backend, kwargs) File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_group default_pg = _new_process_group_helper( File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 998, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13140) of binary: D:\lora-scripts-v1.6.2\python\python.exe Traceback (most recent call last): File "D:\lora-scripts-v1.6.2\python\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\lora-scripts-v1.6.2\python\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 996, in main() File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 992, in main launch_command(args) File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 977, in launch_command multi_gpu_launcher(args) File "D:\lora-scripts-v1.6.2\python\lib\site-packages\accelerate\commands\launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "D:\lora-scripts-v1.6.2\python\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./sd-scripts/sdxl_train_network.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-10-12_18:57:06 host : USER-20230706TY rank : 0 (local_rank: 0) exitcode : 1 (pid: 13140) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 18:57:07-236772 ERROR Training failed / 训练失败
Akegarasu commented 11 months ago

https://github.com/Akegarasu/lora-scripts/issues/257