bmaltais / kohya_ss

Apache License 2.0
8.79k stars 1.14k forks source link

dreamboth checkpoint save error #2610

Closed iselestia closed 1 week ago

iselestia commented 1 week ago

i tried to train my first checkpoint on sd 1.5 base. training started good, no errors, all steps trained but, after completion this error appears. model doesnt saved. til.py:925100%|████████████████████████████████████████████████████████████████████████████████████████| 243/243 [00:00<?, ?it/s] INFO caching latents... train_util.py:962100%|████████████████████████████████████████████████████████████████████████████████| 243/243 [01:50<00:00, 2.19it/s] prepare optimizer, data loader etc. 2024-06-22 16:35:00 INFO use 8-bit AdamW optimizer | {} train_util.py:3621running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 5805 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 5805 num epochs / epoch数: 1 batch size per device / バッチサイズ: 1 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 1 gradient ccumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 5805 steps: 0%| | 0/5805 [00:00<?, ?it/s] epoch 1/1 steps: 100%|█████████████████████████████████████████████████████| 5805/5805 [7:27:54<00:00, 4.63s/it, avr_loss=0.149]Traceback (most recent call last): File "C:\Users\iselestia\Desktop\kohya_ss-23.0.14\sd-scripts\train_db.py", line 504, in train(args) File "C:\Users\iselestia\Desktop\kohya_ss-23.0.14\sd-scripts\train_db.py", line 454, in train train_util.save_sd_model_on_train_end( File "C:\Users\iselestia\Desktop\kohya_ss-23.0.14\sd-scripts\library\train_util.py", line 4557, in save_sd_model_on_train_end save_sd_model_on_train_end_common( File "C:\Users\iselestia\Desktop\kohya_ss-23.0.14\sd-scripts\library\train_util.py", line 4574, in save_sd_model_on_train_end_common os.makedirs(args.output_dir, exist_ok=True) File "C:\Users\iselestia\AppData\Local\Programs\Python\Python310\lib\os.py", line 210, in makedirs head, tail = path.split(name) File "C:\Users\iselestia\AppData\Local\Programs\Python\Python310\lib\ntpath.py", line 211, in split p = os.fspath(p) TypeError: expected str, bytes or os.PathLike object, not NoneType steps: 100%|█████████████████████████████████████████████████████| 5805/5805 [7:27:54<00:00, 4.63s/it, avr_loss=0.149] Traceback (most recent call last): File "C:\Users\iselestia\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\iselestia\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\iselestia\Desktop\kohya_ss-23.0.14\venv\Scripts\accelerate.exe__main__.py", line 7, in File "C:\Users\iselestia\Desktop\kohya_ss-23.0.14\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\iselestia\Desktop\kohya_ss-23.0.14\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Users\iselestia\Desktop\kohya_ss-23.0.14\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\iselestia\Desktop\kohya_ss-23.0.14\venv\Scripts\python.exe', 'C:\Users\iselestia\Desktop\kohya_ss-23.0.14/sd-scripts/train_db.py', '--bucket_no_upscale', '--bucket_reso_steps=64', '--cache_latents', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--learning_rate=1e-05', '--learning_rate_te=1e-05', '--lr_scheduler=cosine', '--lr_scheduler_num_cycles=1', '--lr_warmup_steps=580', '--max_data_loader_n_workers=0', '--resolution=512,768', '--max_train_steps=5805', '--mixed_precision=fp16', '--optimizer_type=AdamW8bit', '--output_name=last', '--pretrained_model_name_or_path=C:/_SD/stable-diffusion-webui-forge-main/models/Stable-diffusion/YutaMixV13.safetensors', '--save_every_n_epochs=1', '--save_model_as=safetensors', '--save_precision=fp16', '--train_batch_size=1', '--train_data_dir=C:/Users/iselestia/Desktop/skirt', '--xformers']' returned non-zero exit status 1.

bmaltais commented 1 week ago

It could be a few things. THe output file path can't be written to because the file is locked. Could also be because there is not enough space. Might also be a permission issue. But this is certainly something related to a condition on your system as this is not a common issue at all.

iselestia commented 1 week ago

It could be a few things. THe output file path can't be written to because the file is locked. Could also be because there is not enough space. Might also be a permission issue. But this is certainly something related to a condition on your system as this is not a common issue at all.

suggested things isnt working, idk why this happens, lora trains without any problems. i tried 10 times by changed settings and folders , same error every time image

iselestia commented 1 week ago

reinstallation helped. It seems that it was really a single bug that arose by chance