Closed pzzmyc closed 11 months ago
运行环境为Linux,报错信息如下,请教一下是什么原因 以下为完整报错信息:
18:26:48-573275 INFO Training started with config file / 训练开始,使用配置文件: /mnt/e/lora-scripts-gui/config/autosave/20231008-182648.toml 18:26:48-579113 INFO Task 2ca5857f-14b2-4316-a479-ef22088c4238 created Loading settings from /mnt/e/lora-scripts-gui/config/autosave/20231008-182648.toml... Loading settings from /mnt/e/lora-scripts-gui/config/autosave/20231008-182648.toml... /mnt/e/lora-scripts-gui/config/autosave/20231008-182648 /mnt/e/lora-scripts-gui/config/autosave/20231008-182648 prepare tokenizer prepare tokenizer update token length: 255 update token length: 255 prepare images. prepare images. found directory /mnt/e/monstertrain/30_monster contains 11161 image files found directory /mnt/e/monstertrain/30_monster contains 11161 image files 334830 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした 334830 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした[Dataset 0] batch_size: 3 resolution: (768, 768) enable_bucket: True min_bucket_reso: 64 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: False [Subset 0 of Dataset 0] image_dir: "/mnt/e/monstertrain/30_monster" image_count: 11161 num_repeats: 30 shuffle_caption: True keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: monster caption_extension: .txt [Dataset 0] loading image sizes. [Dataset 0] batch_size: 3 resolution: (768, 768) enable_bucket: True min_bucket_reso: 64 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: False [Subset 0 of Dataset 0] image_dir: "/mnt/e/monstertrain/30_monster" image_count: 11161 num_repeats: 30 shuffle_caption: True keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: monster caption_extension: .txt [Dataset 0] loading image sizes. 100%|███████████████████████████████████████████████████████████████████████| 11161/11161 [00:14<00:00, 764.51it/s] make buckets 100%|███████████████████████████████████████████████████████████████████████| 11161/11161 [00:14<00:00, 764.48it/s] make buckets number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (384, 1344), count: 450 bucket 1: resolution (384, 1408), count: 120 bucket 2: resolution (384, 1472), count: 120 bucket 3: resolution (384, 1536), count: 240 bucket 4: resolution (448, 1216), count: 2220 bucket 5: resolution (448, 1280), count: 1080 bucket 6: resolution (512, 1088), count: 9000 bucket 7: resolution (512, 1152), count: 5400 bucket 8: resolution (576, 960), count: 19950 bucket 9: resolution (576, 1024), count: 13800 bucket 10: resolution (640, 896), count: 41220 bucket 11: resolution (704, 832), count: 50820 bucket 12: resolution (768, 768), count: 54210 bucket 13: resolution (832, 704), count: 45390 bucket 14: resolution (896, 640), count: 35310 bucket 15: resolution (960, 576), count: 14880 bucket 16: resolution (1024, 576), count: 12840 bucket 17: resolution (1088, 512), count: 8940 bucket 18: resolution (1152, 512), count: 6360 bucket 19: resolution (1216, 448), count: 4740 bucket 20: resolution (1280, 448), count: 2550 bucket 21: resolution (1344, 384), count: 3180 bucket 22: resolution (1408, 384), count: 630 bucket 23: resolution (1472, 384), count: 390 bucket 24: resolution (1536, 384), count: 540 bucket 25: resolution (1600, 320), count: 180 bucket 26: resolution (1728, 320), count: 60 bucket 27: resolution (1792, 320), count: 30 bucket 28: resolution (1856, 256), count: 30 bucket 29: resolution (1984, 256), count: 30 bucket 30: resolution (2048, 256), count: 120 mean ar error (without repeats): 0.04531900909823176 number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (384, 1344), count: 450 bucket 1: resolution (384, 1408), count: 120 bucket 2: resolution (384, 1472), count: 120 bucket 3: resolution (384, 1536), count: 240 bucket 4: resolution (448, 1216), count: 2220 bucket 5: resolution (448, 1280), count: 1080 bucket 6: resolution (512, 1088), count: 9000 bucket 7: resolution (512, 1152), count: 5400 bucket 8: resolution (576, 960), count: 19950 bucket 9: resolution (576, 1024), count: 13800 bucket 10: resolution (640, 896), count: 41220 bucket 11: resolution (704, 832), count: 50820 bucket 12: resolution (768, 768), count: 54210 bucket 13: resolution (832, 704), count: 45390 bucket 14: resolution (896, 640), count: 35310 bucket 15: resolution (960, 576), count: 14880 bucket 16: resolution (1024, 576), count: 12840 bucket 17: resolution (1088, 512), count: 8940 bucket 18: resolution (1152, 512), count: 6360 bucket 19: resolution (1216, 448), count: 4740 bucket 20: resolution (1280, 448), count: 2550 bucket 21: resolution (1344, 384), count: 3180 bucket 22: resolution (1408, 384), count: 630 bucket 23: resolution (1472, 384), count: 390 bucket 24: resolution (1536, 384), count: 540 bucket 25: resolution (1600, 320), count: 180 bucket 26: resolution (1728, 320), count: 60 bucket 27: resolution (1792, 320), count: 30 bucket 28: resolution (1856, 256), count: 30 bucket 29: resolution (1984, 256), count: 30 bucket 30: resolution (2048, 256), count: 120 mean ar error (without repeats): 0.04531900909823176 prepare accelerator prepare accelerator loading model for process 0/2 load StableDiffusion checkpoint: /mnt/e/stable-diffusion-webui/models/Stable-diffusion/GCM-Game Concept Map_v2.0.1.safetensors UNet2DConditionModel: 64, 8, 768, False, False loading u-net: <All keys matched successfully> loading vae: <All keys matched successfully> loading text encoder: <All keys matched successfully> loading model for process 1/2 load StableDiffusion checkpoint: /mnt/e/stable-diffusion-webui/models/Stable-diffusion/GCM-Game Concept Map_v2.0.1.safetensors UNet2DConditionModel: 64, 8, 768, False, False loading u-net: <All keys matched successfully> loading vae: <All keys matched successfully> loading text encoder: <All keys matched successfully> Enable xformers for U-Net Enable xformers for U-Net [Dataset 0] caching latents. checking cache validity... 0%| | 9/11161 [00:00<02:07, 87.30it/s][Dataset 0] caching latents. checking cache validity... 100%|███████████████████████████████████████████████████████████████████| 11161/11161 [00:00<00:00, 1255501.45it/s] 100%|███████████████████████████████████████████████████████████████████████| 11161/11161 [01:37<00:00, 113.95it/s] caching latents... 100%|██████████████████████████████████████████████████████████████████████████| 1733/1733 [02:17<00:00, 12.62it/s] Text Encoder is not trained. prepare optimizer, data loader etc. use 8-bit AdamW optimizer | {}use 8-bit AdamW optimizer | {} Traceback (most recent call last): File "/mnt/e/lora-scripts-gui/./sd-scripts/train_db.py", line 488, in <module> Traceback (most recent call last): File "/mnt/e/lora-scripts-gui/./sd-scripts/train_db.py", line 488, in <module> train(args) File "/mnt/e/lora-scripts-gui/./sd-scripts/train_db.py", line 171, in train train(args) File "/mnt/e/lora-scripts-gui/./sd-scripts/train_db.py", line 171, in train _, _, optimizer = train_util.get_optimizer(args, trainable_params) File "/mnt/e/lora-scripts-gui/sd-scripts/library/train_util.py", line 3455, in get_optimizer _, _, optimizer = train_util.get_optimizer(args, trainable_params) File "/mnt/e/lora-scripts-gui/sd-scripts/library/train_util.py", line 3455, in get_optimizer optimizer = optimizer_class(trainable_params, lr=lr, **optimizer_kwargs) File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/adamw.py", line 17, in __init__ super().__init__( "adam", params, lr, betas, eps, weight_decay, 8, args, min_8bit_size, percentile_clipping, block_wise, is_paged=is_paged ) File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 361, in __init__ super().__init__(params, defaults, optim_bits, is_paged) File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 96, in __init__ super().__init__(params, defaults) File "/home/hhy/.local/lib/python3.10/site-packages/torch/optim/optimizer.py", line 187, in __init__ optimizer = optimizer_class(trainable_params, lr=lr, **optimizer_kwargs) File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/adamw.py", line 17, in __init__ super().__init__( "adam", params, lr, betas, eps, weight_decay, 8, args, min_8bit_size, percentile_clipping, block_wise, is_paged=is_paged ) File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 361, in __init__ super().__init__(params, defaults, optim_bits, is_paged) File "/home/hhy/.local/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 96, in __init__ super().__init__(params, defaults) File "/home/hhy/.local/lib/python3.10/site-packages/torch/optim/optimizer.py", line 187, in __init__ raise ValueError("optimizer got an empty parameter list") ValueErrorraise ValueError("optimizer got an empty parameter list"): optimizer got an empty parameter list ValueError: optimizer got an empty parameter list ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5131) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/hhy/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 996, in <module> main() File "/home/hhy/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 992, in main launch_command(args) File "/home/hhy/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command multi_gpu_launcher(args) File "/home/hhy/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/home/hhy/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/hhy/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/hhy/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ./sd-scripts/train_db.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-10-08_18:32:00 host : DESKTOP-3M8V87N. rank : 1 (local_rank: 1) exitcode : 1 (pid: 5132) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-10-08_18:32:00 host : DESKTOP-3M8V87N. rank : 0 (local_rank: 0) exitcode : 1 (pid: 5131) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 18:32:02-126772 ERROR Training failed / 训练失败
已知问题,windows系统上会出现。
并非Windows,系统为Ubuntu,Windows下也没有nccl支持,奇怪的是,最近dreambooth多卡又能跑起来了
运行环境为Linux,报错信息如下,请教一下是什么原因 以下为完整报错信息: