bmaltais / kohya_ss

Apache License 2.0
9.64k stars 1.24k forks source link

Lora Training doesn`t work anymore and i am to much of a Jerry to understand why, Pls Help #1983

Closed Esvictum closed 6 months ago

Esvictum commented 8 months ago

Hello and Welcome to my problem, my program does not work anymore, a month ago it did however. Unfortunally i make changes in my setup with just halve of my brain, so here we are.

  1. I upgraded my grafic card from a 3060ti to a 4070oc
  2. I tried to make space on my C: disc and deleted big files (gigs), that i thought were Temp files, that were related to AI stuff.
  3. I may have updated some things without understandig if its ok or not.

Now, when i run my A1111 Kohoya Dreambooth Lora Trainer, this is what happens. I am thankful for any suggestion.

17:46:52-538662 INFO Start training LoRA Standard ... 17:46:52-539663 INFO Checking for duplicate image filenames in training data directory... 17:46:52-543667 INFO Valid image folder names found in: D:/AI Zeug/Dump für alte Bilder und Lora source/Lora source Spec/image 17:46:52-548671 INFO Folder 100_Spec: 27 images found 17:46:52-548671 INFO Folder 100_Spec: 2700 steps 17:46:52-549671 INFO Total steps: 2700 17:46:52-550673 INFO Train batch size: 1 17:46:52-551674 INFO Gradient accumulation steps: 1 17:46:52-552674 INFO Epoch: 1 17:46:52-552674 INFO Regulatization factor: 1 17:46:52-553675 INFO max_train_steps (2700 / 1 / 1 1 1) = 2700 17:46:52-554676 INFO stop_text_encoder_training = 0 17:46:52-555677 INFO lr_warmup_steps = 270 17:46:52-556678 INFO Saving training config to D:/AI Zeug/Dump für alte Bilder und Lora source/Lora source Spec/output\spec_20240218-174652.json... 17:46:52-558680 INFO accelerate launch --num_cpu_threads_per_process=2 "./train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --learning_rate="0.0001" --logging_dir="D:/AI Zeug/Dump für alte Bilder und Lora source/Lora source Spec/log" --lr_scheduler="cosine" --lr_scheduler_num_cycles="1" --lr_warmup_steps="270" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="512,512" --max_train_steps="2700" --mixed_precision="fp16" --network_alpha="1" --network_dim=8 --network_module=networks.lora --optimizer_type="AdamW8bit" --output_dir="D:/AI Zeug/Dump für alte Bilder und Lora source/Lora source Spec/output" --output_name="pendulum" --pretrained_model_name_or_path="D:/AI Interface/webui/models/Stable-diffusion/realisticVisionV60B1_v60B1VAE.safetensors" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="float" --text_encoder_lr=0.0001 --train_batch_size="1" --train_data_dir="D:/AI Zeug/Dump für alte Bilder und Lora source/Lora source Spec/image" --unet_lr=0.0001 --xformers A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' prepare tokenizer Using DreamBooth method. prepare images. found directory D:\AI Zeug\Dump für alte Bilder und Lora source\Lora source Spec\image\100_Spec contains 27 image files 2700 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 1 resolution: (512, 512) enable_bucket: True network_multiplier: 1.0 min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "D:\AI Zeug\Dump für alte Bilder und Lora source\Lora source Spec\image\100_Spec" image_count: 27 num_repeats: 100 shuffle_caption: False keep_tokens: 0 keep_tokens_separator: caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: Spec caption_extension: .txt

[Dataset 0] loading image sizes. 100%|████████████████████████████████████████████████████████████████████████████████| 27/27 [00:00<00:00, 3372.43it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (384, 512), count: 300 bucket 1: resolution (448, 320), count: 900 bucket 2: resolution (640, 384), count: 1500 mean ar error (without repeats): 0.08395061728395048 preparing accelerator loading model for process 0/1 load StableDiffusion checkpoint: D:/AI Interface/webui/models/Stable-diffusion/realisticVisionV60B1_v60B1VAE.safetensors UNet2DConditionModel: 64, 8, 768, False, False loading u-net: loading vae: loading text encoder: Enable xformers for U-Net import network module: networks.lora [Dataset 0] caching latents. checking cache validity... 100%|██████████████████████████████████████████████████████████████████████████████████████████| 27/27 [00:00<?, ?it/s] caching latents... 100%|██████████████████████████████████████████████████████████████████████████████████| 27/27 [00:01<00:00, 16.46it/s] create LoRA network. base dim (rank): 8, alpha: 1.0 neuron dropout: p=None, rank dropout: p=None, module dropout: p=None create LoRA for Text Encoder: create LoRA for Text Encoder: 72 modules. create LoRA for U-Net: 192 modules. enable LoRA for text encoder enable LoRA for U-Net prepare optimizer, data loader etc.

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

binary_path: D:\Kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll CUDA SETUP: Loading binary D:\Kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll... use 8-bit AdamW optimizer | {} running training / 学習開始 num train images repeats / 学習画像の数×繰り返し回数: 2700 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 2700 num epochs / epoch数: 1 batch size per device / バッチサイズ: 1 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 2700 steps: 0%| | 0/2700 [00:00<?, ?it/s]Traceback (most recent call last): File "D:\Kohya\kohya_ss\train_network.py", line 1033, in trainer.train(args) File "D:\Kohya\kohya_ss\train_network.py", line 701, in train accelerator.init_trackers( File "D:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 619, in _inner return PartialState().on_main_process(function)(args, kwargs) File "D:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 2331, in init_trackers tracker_init(project_name, self.logging_dir, init_kwargs.get(str(tracker), {})) File "D:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\tracking.py", line 79, in execute_on_main_process return PartialState().on_main_process(function)(self, *args, kwargs) File "D:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\tracking.py", line 190, in init self.writer = tensorboard.SummaryWriter(self.logging_dir, kwargs) File "D:\Kohya\kohya_ss\venv\lib\site-packages\torch\utils\tensorboard\writer.py", line 243, in init self._get_file_writer() File "D:\Kohya\kohya_ss\venv\lib\site-packages\torch\utils\tensorboard\writer.py", line 273, in _get_file_writer self.file_writer = FileWriter( File "D:\Kohya\kohya_ss\venv\lib\site-packages\torch\utils\tensorboard\writer.py", line 72, in init self.event_writer = EventFileWriter( File "D:\Kohya\kohya_ss\venv\lib\site-packages\tensorboard\summary\writer\event_file_writer.py", line 72, in init tf.io.gfile.makedirs(logdir) File "D:\Kohya\kohya_ss\venv\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 513, in recursive_create_dir_v2 _pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path)) tensorflow.python.framework.errors_impl.FailedPreconditionError: D:/AI Zeug/Dump für alte Bilder und Lora source/Lora source Spec/log is not a directory steps: 0%| | 0/2700 [00:00<?, ?it/s] Traceback (most recent call last): File "C:\Users\Johny\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Johny\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\Kohya\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "D:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "D:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "D:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['D:\Kohya\kohya_ss\venv\Scripts\python.exe', './train_network.py', '--bucket_no_upscale', '--bucket_reso_steps=64', '--cache_latents', '--caption_extension=.txt', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--learning_rate=0.0001', '--logging_dir=D:/AI Zeug/Dump für alte Bilder und Lora source/Lora source Spec/log', '--lr_scheduler=cosine', '--lr_scheduler_num_cycles=1', '--lr_warmup_steps=270', '--max_data_loader_n_workers=0', '--max_grad_norm=1', '--resolution=512,512', '--max_train_steps=2700', '--mixed_precision=fp16', '--network_alpha=1', '--network_dim=8', '--network_module=networks.lora', '--optimizer_type=AdamW8bit', '--output_dir=D:/AI Zeug/Dump für alte Bilder und Lora source/Lora source Spec/output', '--output_name=pendulum', '--pretrained_model_name_or_path=D:/AI Interface/webui/models/Stable-diffusion/realisticVisionV60B1_v60B1VAE.safetensors', '--save_every_n_epochs=1', '--save_model_as=safetensors', '--save_precision=float', '--text_encoder_lr=0.0001', '--train_batch_size=1', '--train_data_dir=D:/AI Zeug/Dump für alte Bilder und Lora source/Lora source Spec/image', '--unet_lr=0.0001', '--xformers']' returned non-zero exit status 1.

ssokolow commented 8 months ago

Here's how you read those errors:

  1. Look at the bottom. It says "subprocess.CalledProcessError: Command [very long line] returned non-zero exit status 1.
  2. Since it's a CalledProcessError, scroll up to before the"Traceback (most recent call last):" line to find the actual error message.
  3. The actual error message appears to be this:

tensorflow.python.framework.errors_impl.FailedPreconditionError: D:/AI Zeug/Dump für alte Bilder und Lora source/Lora source Spec/log is not a directory

Did something happen to your "D:\AI Zeug\Dump für alte Bilder und Lora source\Lora source Spec\log"?

5KilosOfCheese commented 8 months ago

Make sure the log directory exists. Then see if this helps: https://github.com/bmaltais/kohya_ss/discussions/1744 Since you are using german, the Ü can cause issues, try replacing it. I know that Ä Ö Å for example can cause an issue of directory errors with bitsandbytes. See if this helps.

ssokolow commented 8 months ago

Make sure the log directory exists. Then see if this helps: #1744 Since you are using german, the Ü can cause issues, try replacing it. I know that Ä Ö Å for example can cause an issue of directory errors with bitsandbytes. See if this helps.

Good point. Whenever a problem involving paths occurs, it's generally a good idea to try switching to paths containing only characters from the POSIX Portable Filename Character Set (ASCII alphanumerics, ., _, and -) to rule that out. (eg. I've had LyX projects refuse to render because LaTeX modules weren't properly tested with paths containing spaces.)

Esvictum commented 8 months ago

I cant believe it. I removed the "ü" and now it just works....after reinstalling everything for hours. I would have never guessed that "Umlaute" are considered evil in the computer and AI world. Thank you very much.

ssokolow commented 8 months ago

It's more that some programs and programming languages haven't been properly tested with Unicode, so they have trouble with anything outside the set of characters in ASCII (which are common to almost all legacy encoding systems, bar a few like Shift-JIS.)

In Kohya's case, it's probably just that Python has some legacy baggage that makes it easy to accidentally do Unicode wrong. (Source: I've been coding in Python since I was a teenager.)

waynespa commented 8 months ago

@Esvictum If your issue is resolved, may you consider closing this issue.