Akegarasu / lora-scripts

LoRA & Dreambooth training scripts & GUI use kohya-ss's trainer, for diffusion model.
GNU Affero General Public License v3.0
4.53k stars 561 forks source link

dreambooth多卡训练出错 #257

Closed pzzmyc closed 11 months ago

pzzmyc commented 1 year ago

运行环境为Linux,报错信息如下,请教一下是什么原因 以下为完整报错信息:

18:26:48-573275 INFO     Training started with config file / 训练开始,使用配置文件:                               
                         /mnt/e/lora-scripts-gui/config/autosave/20231008-182648.toml                              
18:26:48-579113 INFO     Task 2ca5857f-14b2-4316-a479-ef22088c4238 created                                         
Loading settings from /mnt/e/lora-scripts-gui/config/autosave/20231008-182648.toml...
Loading settings from /mnt/e/lora-scripts-gui/config/autosave/20231008-182648.toml...
/mnt/e/lora-scripts-gui/config/autosave/20231008-182648
/mnt/e/lora-scripts-gui/config/autosave/20231008-182648
prepare tokenizer
prepare tokenizer
update token length: 255
update token length: 255
prepare images.
prepare images.
found directory /mnt/e/monstertrain/30_monster contains 11161 image files
found directory /mnt/e/monstertrain/30_monster contains 11161 image files
334830 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
334830 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした[Dataset 0]
  batch_size: 3
  resolution: (768, 768)
  enable_bucket: True
  min_bucket_reso: 64
  max_bucket_reso: 2048
  bucket_reso_steps: 64
  bucket_no_upscale: False

  [Subset 0 of Dataset 0]
    image_dir: "/mnt/e/monstertrain/30_monster"
    image_count: 11161
    num_repeats: 30
    shuffle_caption: True
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: monster
    caption_extension: .txt

[Dataset 0]
loading image sizes.
[Dataset 0]
  batch_size: 3
  resolution: (768, 768)
  enable_bucket: True
  min_bucket_reso: 64
  max_bucket_reso: 2048
  bucket_reso_steps: 64
  bucket_no_upscale: False

  [Subset 0 of Dataset 0]
    image_dir: "/mnt/e/monstertrain/30_monster"
    image_count: 11161
    num_repeats: 30
    shuffle_caption: True
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: monster
    caption_extension: .txt

[Dataset 0]
loading image sizes.
100%|███████████████████████████████████████████████████████████████████████| 11161/11161 [00:14<00:00, 764.51it/s]
make buckets
100%|███████████████████████████████████████████████████████████████████████| 11161/11161 [00:14<00:00, 764.48it/s]
make buckets
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (384, 1344), count: 450
bucket 1: resolution (384, 1408), count: 120
bucket 2: resolution (384, 1472), count: 120
bucket 3: resolution (384, 1536), count: 240
bucket 4: resolution (448, 1216), count: 2220
bucket 5: resolution (448, 1280), count: 1080
bucket 6: resolution (512, 1088), count: 9000
bucket 7: resolution (512, 1152), count: 5400
bucket 8: resolution (576, 960), count: 19950
bucket 9: resolution (576, 1024), count: 13800
bucket 10: resolution (640, 896), count: 41220
bucket 11: resolution (704, 832), count: 50820
bucket 12: resolution (768, 768), count: 54210
bucket 13: resolution (832, 704), count: 45390
bucket 14: resolution (896, 640), count: 35310
bucket 15: resolution (960, 576), count: 14880
bucket 16: resolution (1024, 576), count: 12840
bucket 17: resolution (1088, 512), count: 8940
bucket 18: resolution (1152, 512), count: 6360
bucket 19: resolution (1216, 448), count: 4740
bucket 20: resolution (1280, 448), count: 2550
bucket 21: resolution (1344, 384), count: 3180
bucket 22: resolution (1408, 384), count: 630
bucket 23: resolution (1472, 384), count: 390
bucket 24: resolution (1536, 384), count: 540
bucket 25: resolution (1600, 320), count: 180
bucket 26: resolution (1728, 320), count: 60
bucket 27: resolution (1792, 320), count: 30
bucket 28: resolution (1856, 256), count: 30
bucket 29: resolution (1984, 256), count: 30
bucket 30: resolution (2048, 256), count: 120
mean ar error (without repeats): 0.04531900909823176
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (384, 1344), count: 450
bucket 1: resolution (384, 1408), count: 120
bucket 2: resolution (384, 1472), count: 120
bucket 3: resolution (384, 1536), count: 240
bucket 4: resolution (448, 1216), count: 2220
bucket 5: resolution (448, 1280), count: 1080
bucket 6: resolution (512, 1088), count: 9000
bucket 7: resolution (512, 1152), count: 5400
bucket 8: resolution (576, 960), count: 19950
bucket 9: resolution (576, 1024), count: 13800
bucket 10: resolution (640, 896), count: 41220
bucket 11: resolution (704, 832), count: 50820
bucket 12: resolution (768, 768), count: 54210
bucket 13: resolution (832, 704), count: 45390
bucket 14: resolution (896, 640), count: 35310
bucket 15: resolution (960, 576), count: 14880
bucket 16: resolution (1024, 576), count: 12840
bucket 17: resolution (1088, 512), count: 8940
bucket 18: resolution (1152, 512), count: 6360
bucket 19: resolution (1216, 448), count: 4740
bucket 20: resolution (1280, 448), count: 2550
bucket 21: resolution (1344, 384), count: 3180
bucket 22: resolution (1408, 384), count: 630
bucket 23: resolution (1472, 384), count: 390
bucket 24: resolution (1536, 384), count: 540
bucket 25: resolution (1600, 320), count: 180
bucket 26: resolution (1728, 320), count: 60
bucket 27: resolution (1792, 320), count: 30
bucket 28: resolution (1856, 256), count: 30
bucket 29: resolution (1984, 256), count: 30
bucket 30: resolution (2048, 256), count: 120
mean ar error (without repeats): 0.04531900909823176
prepare accelerator
prepare accelerator
loading model for process 0/2
load StableDiffusion checkpoint: /mnt/e/stable-diffusion-webui/models/Stable-diffusion/GCM-Game Concept Map_v2.0.1.safetensors
UNet2DConditionModel: 64, 8, 768, False, False
loading u-net: <All keys matched successfully>
loading vae: <All keys matched successfully>
loading text encoder: <All keys matched successfully>
loading model for process 1/2
load StableDiffusion checkpoint: /mnt/e/stable-diffusion-webui/models/Stable-diffusion/GCM-Game Concept Map_v2.0.1.safetensors
UNet2DConditionModel: 64, 8, 768, False, False
loading u-net: <All keys matched successfully>
loading vae: <All keys matched successfully>
loading text encoder: <All keys matched successfully>
Enable xformers for U-Net
Enable xformers for U-Net
[Dataset 0]
caching latents.
checking cache validity...
  0%|                                                                            | 9/11161 [00:00<02:07, 87.30it/s][Dataset 0]
caching latents.
checking cache validity...
100%|███████████████████████████████████████████████████████████████████| 11161/11161 [00:00<00:00, 1255501.45it/s]
100%|███████████████████████████████████████████████████████████████████████| 11161/11161 [01:37<00:00, 113.95it/s]
caching latents...
100%|██████████████████████████████████████████████████████████████████████████| 1733/1733 [02:17<00:00, 12.62it/s]
Text Encoder is not trained.
prepare optimizer, data loader etc.
use 8-bit AdamW optimizer | {}use 8-bit AdamW optimizer | {}

Traceback (most recent call last):
  File "/mnt/e/lora-scripts-gui/./sd-scripts/train_db.py", line 488, in <module>
Traceback (most recent call last):
  File "/mnt/e/lora-scripts-gui/./sd-scripts/train_db.py", line 488, in <module>
    train(args)
  File "/mnt/e/lora-scripts-gui/./sd-scripts/train_db.py", line 171, in train
    train(args)
  File "/mnt/e/lora-scripts-gui/./sd-scripts/train_db.py", line 171, in train
    _, _, optimizer = train_util.get_optimizer(args, trainable_params)
  File "/mnt/e/lora-scripts-gui/sd-scripts/library/train_util.py", line 3455, in get_optimizer
    _, _, optimizer = train_util.get_optimizer(args, trainable_params)
  File "/mnt/e/lora-scripts-gui/sd-scripts/library/train_util.py", line 3455, in get_optimizer
    optimizer = optimizer_class(trainable_params, lr=lr, **optimizer_kwargs)
  File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/adamw.py", line 17, in __init__
    super().__init__( "adam", params, lr, betas, eps, weight_decay, 8, args, min_8bit_size, percentile_clipping, block_wise, is_paged=is_paged )
  File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 361, in __init__
    super().__init__(params, defaults, optim_bits, is_paged)
  File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 96, in __init__
        super().__init__(params, defaults)
  File "/home/hhy/.local/lib/python3.10/site-packages/torch/optim/optimizer.py", line 187, in __init__
optimizer = optimizer_class(trainable_params, lr=lr, **optimizer_kwargs)
  File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/adamw.py", line 17, in __init__
    super().__init__( "adam", params, lr, betas, eps, weight_decay, 8, args, min_8bit_size, percentile_clipping, block_wise, is_paged=is_paged )
  File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 361, in __init__
    super().__init__(params, defaults, optim_bits, is_paged)
  File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 96, in __init__
    super().__init__(params, defaults)
  File "/home/hhy/.local/lib/python3.10/site-packages/torch/optim/optimizer.py", line 187, in __init__
    raise ValueError("optimizer got an empty parameter list")
    ValueErrorraise ValueError("optimizer got an empty parameter list"): 
optimizer got an empty parameter list
ValueError: optimizer got an empty parameter list
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5131) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/hhy/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 996, in <module>
    main()
  File "/home/hhy/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 992, in main
    launch_command(args)
  File "/home/hhy/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File "/home/hhy/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/hhy/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/hhy/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/hhy/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./sd-scripts/train_db.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-10-08_18:32:00
  host      : DESKTOP-3M8V87N.
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 5132)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-08_18:32:00
  host      : DESKTOP-3M8V87N.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5131)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
18:32:02-126772 ERROR    Training failed / 训练失败 
Akegarasu commented 11 months ago

已知问题,windows系统上会出现。

pzzmyc commented 11 months ago

已知问题,windows系统上会出现。

并非Windows,系统为Ubuntu,Windows下也没有nccl支持,奇怪的是,最近dreambooth多卡又能跑起来了