bmaltais / kohya_ss

Apache License 2.0
9.64k stars 1.24k forks source link

There is a lot of errors that I will try to explain that come down to training failing and useless sample generation with wrong resolution #2309

Closed CRCODE22 closed 6 months ago

CRCODE22 commented 6 months ago

epoch 1/10 steps: 10%|████████████████████████▉ | 390/3900 [48:00<7:12:07, 7.39s/it, avr_loss=0.132] saving checkpoint: K:/AI/Training/Dataset/test_woman-000001.safetensors 2024-04-17 07:13:54 INFO train_util.py:5130 INFO generating sample images at step / サンプル画像生成 ステップ: 390 train_util.py:5131 2024-04-17 07:13:55 INFO prompt: K:/AI/Training/Dataset/test_woman_v1/model train_util.py:5284 INFO negative_prompt: None train_util.py:5285 INFO height: 512 train_util.py:5286 INFO width: 512 train_util.py:5287 INFO sample_steps: 30 train_util.py:5288 INFO scale: 7.5 train_util.py:5289 INFO sample_sampler: dpm_2 train_util.py:5290 K:\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn(

epoch 2/10 steps: 20%|█████████████████████████████████████████████████▍ | 780/3900 [1:38:12<6:32:48, 7.55s/it, avr_loss=0.135] saving checkpoint: K:/AI/Training/Dataset/test_woman-000002.safetensors 2024-04-17 08:04:07 INFO train_util.py:5130 INFO generating sample images at step / サンプル画像生成 ステップ: 780 train_util.py:5131 2024-04-17 08:04:08 INFO prompt: K:/AI/Training/Dataset/test_woman_v1/model train_util.py:5284 INFO negative_prompt: None train_util.py:5285 INFO height: 512 train_util.py:5286 INFO width: 512 train_util.py:5287 INFO sample_steps: 30 train_util.py:5288 INFO scale: 7.5 train_util.py:5289 INFO sample_sampler: dpm_2 train_util.py:5290

epoch 3/10 steps: 25%|██████████████████████████████████████████████████████████████ | 980/3900 [2:06:38<6:17:21, 7.75s/it, avr_loss=0.126]Traceback (most recent call last): File "K:\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in trainer.train(args) File "K:\kohya_ss\sd-scripts\train_network.py", line 804, in train for step, batch in enumerate(train_dataloader): File "K:\kohya_ss\venv\lib\site-packages\accelerate\data_loader.py", line 458, in iter next_batch = next(dataloader_iter) File "K:\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 630, in next data = self._next_data() File "K:\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 674, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "K:\kohya_ss\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "K:\kohya_ss\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "K:\kohya_ss\venv\lib\site-packages\torch\utils\data\dataset.py", line 302, in getitem return self.datasets[dataset_idx][sample_idx] File "K:\kohya_ss\sd-scripts\library\train_util.py", line 1207, in getitem img, face_cx, face_cy, face_w, face_h = self.load_image_with_face_info(subset, image_info.absolute_path) File "K:\kohya_ss\sd-scripts\library\train_util.py", line 1092, in load_image_with_face_info img = load_image(image_path) File "K:\kohya_ss\sd-scripts\library\train_util.py", line 2352, in load_image img = np.array(image, np.uint8) File "K:\kohya_ss\venv\lib\site-packages\PIL\Image.py", line 681, in array_interface__ new["data"] = self.tobytes() File "K:\kohya_ss\venv\lib\site-packages\PIL\Image.py", line 761, in tobytes return b"".join(output) MemoryError steps: 25%|██████████████████████████████████████████████████████████████ | 980/3900 [2:06:39<6:17:22, 7.75s/it, avr_loss=0.126] Traceback (most recent call last): File "E:\anaconda3\envs\kohya_ss\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "E:\anaconda3\envs\kohya_ss\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "K:\kohya_ss\venv\Scripts\accelerate.exe\main__.py", line 7, in File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['K:\kohya_ss\venv\Scripts\python.exe', 'K:/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', './outputs/tmpfilelora.toml']' returned non-zero exit status 1. 08:32:33-254742 INFO Training has ended.

You can see above it is generating the sample 512x512 which is wrong as I have the following in the sample generation:

masterpiece, best quality, (test woman), solo, wearing black leather pants and a red tshirt, futuristic setting, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy, bad composition, poor, low effort --w 768 --h 1024--d 1 --l 4.0 --s 40

Test_Woman_1 Test_Woman_2

I hope that kohya_ss can be fixed it was working very good until I updated it and with all the changes related to security or something it seems to be completely broken and I have been unable to train loras for days now. I have been able to train sdxl loras with kohya_ss for many months so I know what I am doing when it comes to setting up settings for training and 16 GB vram used to be more than enough and I could even train with a batch size of 4 now even a batch size of 1 is using around 15500 VRAM. I hope @bmaltais can figure out what is responsible for the increases VRAM usage and why Kohya_ss appears to be completely broken under Windows 11 Pro command prompt (cmd) but also when you try with Anaconda Prompt and I am using the correct python version etc in both.

CRCODE22 commented 6 months ago

Here is more information from what kohya_ss said before training started:

06:24:51-478486 INFO Folder 2_testwoman woman: 2 repeats found 06:24:51-480486 INFO Folder 2_testwoman woman: 195 images found 06:24:51-481485 INFO Folder 2_testwoman woman: 195 2 = 390 steps 06:24:51-482487 INFO Regulatization factor: 1 06:24:51-483486 INFO Total steps: 390 06:24:51-485486 INFO Train batch size: 1 06:24:51-486486 INFO Gradient accumulation steps: 1 06:24:51-487486 INFO Epoch: 10 06:24:51-489488 INFO max_train_steps (390 / 1 / 1 10 * 1) = 3900 06:24:51-490486 INFO stop_text_encoder_training = 0 06:24:51-491491 INFO lr_warmup_steps = 0 06:24:51-496494 INFO Saving training config to K:/AI/Training/Dataset/test_woman_v1/model\test_woman_v1_20240417-062451.json... 06:24:51-499495 INFO Executing command: "K:\kohya_ss\venv\Scripts\accelerate.EXE" launch --dynamo_backend no --dynamo_mode default --gpu_ids 0 --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 "K:/kohya_ss/sd-scripts/sdxl_train_network.py" --config_file "./outputs/tmpfilelora.toml" with shell=True 06:24:51-507494 INFO Command executed. A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' 2024-04-17 06:25:10 INFO Loading settings from ./outputs/tmpfilelora.toml... train_util.py:3744 INFO ./outputs/tmpfilelora train_util.py:3763 2024-04-17 06:25:10 INFO prepare tokenizers sdxl_train_util.py:134 2024-04-17 06:25:11 INFO update token length: 150 sdxl_train_util.py:159 INFO Using DreamBooth method. train_network.py:172 2024-04-17 06:25:12 INFO prepare images. train_util.py:1572 INFO found directory K:\AI\Training\Dataset\test_woman_v1\img\2_testwoman woman contains 195 image files train_util.py:1519 INFO 390 train images with repeating. train_util.py:1613 INFO 0 reg images. train_util.py:1616 WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1621 INFO [Dataset 0] config_util.py:565 batch_size: 1 resolution: (1280, 1280) enable_bucket: True network_multiplier: 1.0 min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True

                           [Subset 0 of Dataset 0]
                             image_dir: "K:\AI\Training\Dataset\test_woman_v1\img\2_testwoman woman"
                             image_count: 195
                             num_repeats: 2
                             shuffle_caption: False
                             keep_tokens: 0
                             keep_tokens_separator:
                             secondary_separator: None
                             enable_wildcard: False
                             caption_dropout_rate: 0
                             caption_dropout_every_n_epoches: 0
                             caption_tag_dropout_rate: 0.0
                             caption_prefix: None
                             caption_suffix: None
                             color_aug: False
                             flip_aug: False
                             face_crop_aug_range: None
                             random_crop: False
                             token_warmup_min: 1,
                             token_warmup_step: 0,
                             is_reg: False
                             class_tokens: testwoman woman
                             caption_extension: .txt

                INFO     [Dataset 0]                                                                                                                                                                                                                                                                 config_util.py:571
                INFO     loading image sizes.                                                                                                                                                                                                                                                         train_util.py:853

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 195/195 [00:00<00:00, 7643.11it/s] INFO make buckets train_util.py:859 WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / train_util.py:876 bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます INFO number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) train_util.py:905 INFO bucket 0: resolution (320, 1280), count: 2 train_util.py:910 INFO bucket 1: resolution (576, 1280), count: 2 train_util.py:910 INFO bucket 2: resolution (640, 832), count: 2 train_util.py:910 INFO bucket 3: resolution (640, 1280), count: 4 train_util.py:910 INFO bucket 4: resolution (704, 1088), count: 2 train_util.py:910 INFO bucket 5: resolution (704, 1280), count: 6 train_util.py:910 INFO bucket 6: resolution (768, 960), count: 2 train_util.py:910 INFO bucket 7: resolution (768, 1280), count: 10 train_util.py:910 INFO bucket 8: resolution (768, 1408), count: 4 train_util.py:910 INFO bucket 9: resolution (768, 1472), count: 4 train_util.py:910 INFO bucket 10: resolution (768, 1600), count: 2 train_util.py:910 INFO bucket 11: resolution (832, 1280), count: 30 train_util.py:910 INFO bucket 12: resolution (832, 1728), count: 4 train_util.py:910 INFO bucket 13: resolution (896, 768), count: 2 train_util.py:910 INFO bucket 14: resolution (896, 1280), count: 4 train_util.py:910 INFO bucket 15: resolution (896, 1408), count: 2 train_util.py:910 INFO bucket 16: resolution (896, 1536), count: 2 train_util.py:910 INFO bucket 17: resolution (896, 1600), count: 2 train_util.py:910 INFO bucket 18: resolution (960, 640), count: 2 train_util.py:910 INFO bucket 19: resolution (960, 1088), count: 2 train_util.py:910 INFO bucket 20: resolution (960, 1280), count: 18 train_util.py:910 INFO bucket 21: resolution (960, 1408), count: 2 train_util.py:910 INFO bucket 22: resolution (960, 1472), count: 6 train_util.py:910 INFO bucket 23: resolution (960, 1536), count: 2 train_util.py:910 INFO bucket 24: resolution (1024, 1280), count: 26 train_util.py:910 INFO bucket 25: resolution (1024, 1344), count: 22 train_util.py:910 INFO bucket 26: resolution (1024, 1408), count: 10 train_util.py:910 INFO bucket 27: resolution (1024, 1472), count: 4 train_util.py:910 INFO bucket 28: resolution (1024, 1536), count: 62 train_util.py:910 INFO bucket 29: resolution (1088, 1280), count: 10 train_util.py:910 INFO bucket 30: resolution (1088, 1344), count: 30 train_util.py:910 INFO bucket 31: resolution (1088, 1408), count: 4 train_util.py:910 INFO bucket 32: resolution (1088, 1472), count: 4 train_util.py:910 INFO bucket 33: resolution (1152, 832), count: 2 train_util.py:910 INFO bucket 34: resolution (1152, 1280), count: 6 train_util.py:910 INFO bucket 35: resolution (1152, 1344), count: 4 train_util.py:910 INFO bucket 36: resolution (1152, 1408), count: 6 train_util.py:910 INFO bucket 37: resolution (1216, 1280), count: 4 train_util.py:910 INFO bucket 38: resolution (1280, 768), count: 2 train_util.py:910 INFO bucket 39: resolution (1280, 832), count: 2 train_util.py:910 INFO bucket 40: resolution (1280, 896), count: 2 train_util.py:910 INFO bucket 41: resolution (1280, 960), count: 2 train_util.py:910 INFO bucket 42: resolution (1280, 1024), count: 2 train_util.py:910 INFO bucket 43: resolution (1280, 1152), count: 6 train_util.py:910 INFO bucket 44: resolution (1280, 1216), count: 2 train_util.py:910 INFO bucket 45: resolution (1280, 1280), count: 18 train_util.py:910 INFO bucket 46: resolution (1344, 1152), count: 4 train_util.py:910 INFO bucket 47: resolution (1408, 704), count: 2 train_util.py:910 INFO bucket 48: resolution (1408, 960), count: 2 train_util.py:910 INFO bucket 49: resolution (1408, 1024), count: 2 train_util.py:910 INFO bucket 50: resolution (1408, 1088), count: 2 train_util.py:910 INFO bucket 51: resolution (1408, 1152), count: 4 train_util.py:910 INFO bucket 52: resolution (1472, 1088), count: 6 train_util.py:910 INFO bucket 53: resolution (1536, 832), count: 8 train_util.py:910 INFO bucket 54: resolution (1536, 1024), count: 4 train_util.py:910 INFO bucket 55: resolution (1664, 960), count: 6 train_util.py:910 INFO bucket 56: resolution (1792, 896), count: 2 train_util.py:910 INFO mean ar error (without repeats): 0.016963672547761124 train_util.py:915 WARNING clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません sdxl_train_util.py:343 INFO preparing accelerator train_network.py:225 accelerator device: cuda INFO loading model for process 0/1 sdxl_train_util.py:30 INFO load StableDiffusion checkpoint: K:/ArtificialIntelligenceModels/models/Stable-diffusion/sd_xl_base_1.0.safetensors sdxl_train_util.py:70 INFO building U-Net sdxl_model_util.py:192 2024-04-17 06:25:13 INFO loading U-Net from checkpoint sdxl_model_util.py:196 2024-04-17 06:25:22 INFO U-Net: sdxl_model_util.py:202 INFO building text encoders sdxl_model_util.py:205 INFO loading text encoders from checkpoint sdxl_model_util.py:258 2024-04-17 06:25:23 INFO text encoder 1: sdxl_model_util.py:272 2024-04-17 06:25:27 INFO text encoder 2: sdxl_model_util.py:276 INFO building VAE sdxl_model_util.py:279 INFO loading VAE from checkpoint sdxl_model_util.py:284 2024-04-17 06:25:28 INFO VAE: sdxl_model_util.py:287 INFO Enable memory efficient attention for U-Net train_util.py:2657 import network module: networks.lora 2024-04-17 06:25:31 INFO create LoRA network. base dim (rank): 64, alpha: 32 lora.py:810 INFO neuron dropout: p=0, rank dropout: p=None, module dropout: p=None lora.py:811 INFO create LoRA for Text Encoder 1: lora.py:902 INFO create LoRA for Text Encoder 2: lora.py:902 2024-04-17 06:25:32 INFO create LoRA for Text Encoder: 264 modules. lora.py:910 2024-04-17 06:25:33 INFO create LoRA for U-Net: 722 modules. lora.py:918 INFO enable LoRA for text encoder lora.py:961 INFO enable LoRA for U-Net lora.py:966 prepare optimizer, data loader etc. 2024-04-17 06:25:34 INFO use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False} train_util.py:4047 WARNING because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / max_grad_normが設定されているためclip_grad_normが有効になります。0に設定して無効にしたほうがいいかもしれません train_util.py:4075 WARNING constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません train_util.py:4079 running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 390 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 390 num epochs / epoch数: 10 batch size per device / バッチサイズ: 1 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 3900

CRCODE22 commented 6 months ago

@bmaltais

CRCODE22 commented 6 months ago

Here is also the contents of tmpfilelora.toml:

bucket_no_upscale = true bucket_reso_steps = 64 caption_dropout_every_n_epochs = 0 caption_dropout_rate = 0 caption_extension = ".txt" clip_skip = 1 dynamo_backend = "no" enable_bucket = true epoch = 10 gradient_accumulation_steps = 1 gradient_checkpointing = true huber_c = 0.1 huber_schedule = "snr" keep_tokens = 0 learning_rate = 0.0003 logging_dir = "K:/AI/Training/Dataset/testwoman_v1/log" loss_type = "l2" lr_scheduler = "cosine" lr_scheduler_args = [] lr_scheduler_num_cycles = 1 lr_scheduler_power = 1 lr_warmup_steps = 0 max_bucket_reso = 2048 max_data_loader_n_workers = 0 max_grad_norm = 1 max_timestep = 1000 max_token_length = 150 max_train_steps = 3900 mem_eff_attn = true min_bucket_reso = 256 mixed_precision = "bf16" multires_noise_discount = 0 network_alpha = 32 network_args = [] network_dim = 64 network_dropout = 0 network_module = "networks.lora" no_half_vae = true noise_offset_type = "Original" optimizer_type = "Adafactor" optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False",] output_dir = "K:/AI/Training/Dataset/testwoman_v1/model" output_name = "testwoman_v1" pretrained_model_name_or_path = "K:/ArtificialIntelligenceModels/models/Stable-diffusion/sd_xl_base_1.0.safetensors" prior_loss_weight = 1 resolution = "1280,1280" sample_every_n_epochs = 1 sample_prompts = "K:/AI/Training/Dataset/testwoman_v1/model\prompt.txt" sample_sampler = "dpm_2" save_every_n_epochs = 1 save_model_as = "safetensors" save_precision = "bf16" scale_weight_norms = 0 text_encoder_lr = 0.0003 train_batch_size = 1 train_data_dir = "K:/AI/Training/Dataset/testwoman_v1/img" unet_lr = 0.0003 xformers = true

bmaltais commented 6 months ago

Everything look good. Honestly if the training run and it does not produce good results than this is a question for the sd-scripts repo issues page:

https://github.com/kohya-ss/sd-scripts/issues

You should provide the toml alongside the running command in the issue so it can be verified.

Maybe the GUI is not setting a value properly but the toml file look great...

What is your VRAM and System Memory graph look like when yu run? Are they topping up?

THe GUI use new versions of sd-scripts as kohya update them... so how the software run today is not the same as it used to a few days ago... things change quickly.

Chadius commented 6 months ago

Double check that your samples are actually using the sample prompt. When I checked an hour ago the text contents was literally the filename. It didn't copy the prompt from the UI. Also it thinks I'm using FP16 even though I set it to BF16. I'll probably revert and try again later.

bmaltais commented 6 months ago

Yeah, there was an issue I fixed in the dev branch.. might be why

Chadius commented 6 months ago

Ooh! Thanks for the quick response (in this forum and on your dev branch!)