Closed CRCODE22 closed 6 months ago
Here is more information from what kohya_ss said before training started:
06:24:51-478486 INFO Folder 2_testwoman woman: 2 repeats found 06:24:51-480486 INFO Folder 2_testwoman woman: 195 images found 06:24:51-481485 INFO Folder 2_testwoman woman: 195 2 = 390 steps 06:24:51-482487 INFO Regulatization factor: 1 06:24:51-483486 INFO Total steps: 390 06:24:51-485486 INFO Train batch size: 1 06:24:51-486486 INFO Gradient accumulation steps: 1 06:24:51-487486 INFO Epoch: 10 06:24:51-489488 INFO max_train_steps (390 / 1 / 1 10 * 1) = 3900 06:24:51-490486 INFO stop_text_encoder_training = 0 06:24:51-491491 INFO lr_warmup_steps = 0 06:24:51-496494 INFO Saving training config to K:/AI/Training/Dataset/test_woman_v1/model\test_woman_v1_20240417-062451.json... 06:24:51-499495 INFO Executing command: "K:\kohya_ss\venv\Scripts\accelerate.EXE" launch --dynamo_backend no --dynamo_mode default --gpu_ids 0 --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 "K:/kohya_ss/sd-scripts/sdxl_train_network.py" --config_file "./outputs/tmpfilelora.toml" with shell=True 06:24:51-507494 INFO Command executed. A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' 2024-04-17 06:25:10 INFO Loading settings from ./outputs/tmpfilelora.toml... train_util.py:3744 INFO ./outputs/tmpfilelora train_util.py:3763 2024-04-17 06:25:10 INFO prepare tokenizers sdxl_train_util.py:134 2024-04-17 06:25:11 INFO update token length: 150 sdxl_train_util.py:159 INFO Using DreamBooth method. train_network.py:172 2024-04-17 06:25:12 INFO prepare images. train_util.py:1572 INFO found directory K:\AI\Training\Dataset\test_woman_v1\img\2_testwoman woman contains 195 image files train_util.py:1519 INFO 390 train images with repeating. train_util.py:1613 INFO 0 reg images. train_util.py:1616 WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1621 INFO [Dataset 0] config_util.py:565 batch_size: 1 resolution: (1280, 1280) enable_bucket: True network_multiplier: 1.0 min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True
[Subset 0 of Dataset 0]
image_dir: "K:\AI\Training\Dataset\test_woman_v1\img\2_testwoman woman"
image_count: 195
num_repeats: 2
shuffle_caption: False
keep_tokens: 0
keep_tokens_separator:
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: testwoman woman
caption_extension: .txt
INFO [Dataset 0] config_util.py:571
INFO loading image sizes. train_util.py:853
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 195/195 [00:00<00:00, 7643.11it/s]
INFO make buckets train_util.py:859
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / train_util.py:876
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
INFO number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) train_util.py:905
INFO bucket 0: resolution (320, 1280), count: 2 train_util.py:910
INFO bucket 1: resolution (576, 1280), count: 2 train_util.py:910
INFO bucket 2: resolution (640, 832), count: 2 train_util.py:910
INFO bucket 3: resolution (640, 1280), count: 4 train_util.py:910
INFO bucket 4: resolution (704, 1088), count: 2 train_util.py:910
INFO bucket 5: resolution (704, 1280), count: 6 train_util.py:910
INFO bucket 6: resolution (768, 960), count: 2 train_util.py:910
INFO bucket 7: resolution (768, 1280), count: 10 train_util.py:910
INFO bucket 8: resolution (768, 1408), count: 4 train_util.py:910
INFO bucket 9: resolution (768, 1472), count: 4 train_util.py:910
INFO bucket 10: resolution (768, 1600), count: 2 train_util.py:910
INFO bucket 11: resolution (832, 1280), count: 30 train_util.py:910
INFO bucket 12: resolution (832, 1728), count: 4 train_util.py:910
INFO bucket 13: resolution (896, 768), count: 2 train_util.py:910
INFO bucket 14: resolution (896, 1280), count: 4 train_util.py:910
INFO bucket 15: resolution (896, 1408), count: 2 train_util.py:910
INFO bucket 16: resolution (896, 1536), count: 2 train_util.py:910
INFO bucket 17: resolution (896, 1600), count: 2 train_util.py:910
INFO bucket 18: resolution (960, 640), count: 2 train_util.py:910
INFO bucket 19: resolution (960, 1088), count: 2 train_util.py:910
INFO bucket 20: resolution (960, 1280), count: 18 train_util.py:910
INFO bucket 21: resolution (960, 1408), count: 2 train_util.py:910
INFO bucket 22: resolution (960, 1472), count: 6 train_util.py:910
INFO bucket 23: resolution (960, 1536), count: 2 train_util.py:910
INFO bucket 24: resolution (1024, 1280), count: 26 train_util.py:910
INFO bucket 25: resolution (1024, 1344), count: 22 train_util.py:910
INFO bucket 26: resolution (1024, 1408), count: 10 train_util.py:910
INFO bucket 27: resolution (1024, 1472), count: 4 train_util.py:910
INFO bucket 28: resolution (1024, 1536), count: 62 train_util.py:910
INFO bucket 29: resolution (1088, 1280), count: 10 train_util.py:910
INFO bucket 30: resolution (1088, 1344), count: 30 train_util.py:910
INFO bucket 31: resolution (1088, 1408), count: 4 train_util.py:910
INFO bucket 32: resolution (1088, 1472), count: 4 train_util.py:910
INFO bucket 33: resolution (1152, 832), count: 2 train_util.py:910
INFO bucket 34: resolution (1152, 1280), count: 6 train_util.py:910
INFO bucket 35: resolution (1152, 1344), count: 4 train_util.py:910
INFO bucket 36: resolution (1152, 1408), count: 6 train_util.py:910
INFO bucket 37: resolution (1216, 1280), count: 4 train_util.py:910
INFO bucket 38: resolution (1280, 768), count: 2 train_util.py:910
INFO bucket 39: resolution (1280, 832), count: 2 train_util.py:910
INFO bucket 40: resolution (1280, 896), count: 2 train_util.py:910
INFO bucket 41: resolution (1280, 960), count: 2 train_util.py:910
INFO bucket 42: resolution (1280, 1024), count: 2 train_util.py:910
INFO bucket 43: resolution (1280, 1152), count: 6 train_util.py:910
INFO bucket 44: resolution (1280, 1216), count: 2 train_util.py:910
INFO bucket 45: resolution (1280, 1280), count: 18 train_util.py:910
INFO bucket 46: resolution (1344, 1152), count: 4 train_util.py:910
INFO bucket 47: resolution (1408, 704), count: 2 train_util.py:910
INFO bucket 48: resolution (1408, 960), count: 2 train_util.py:910
INFO bucket 49: resolution (1408, 1024), count: 2 train_util.py:910
INFO bucket 50: resolution (1408, 1088), count: 2 train_util.py:910
INFO bucket 51: resolution (1408, 1152), count: 4 train_util.py:910
INFO bucket 52: resolution (1472, 1088), count: 6 train_util.py:910
INFO bucket 53: resolution (1536, 832), count: 8 train_util.py:910
INFO bucket 54: resolution (1536, 1024), count: 4 train_util.py:910
INFO bucket 55: resolution (1664, 960), count: 6 train_util.py:910
INFO bucket 56: resolution (1792, 896), count: 2 train_util.py:910
INFO mean ar error (without repeats): 0.016963672547761124 train_util.py:915
WARNING clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません sdxl_train_util.py:343
INFO preparing accelerator train_network.py:225
accelerator device: cuda
INFO loading model for process 0/1 sdxl_train_util.py:30
INFO load StableDiffusion checkpoint: K:/ArtificialIntelligenceModels/models/Stable-diffusion/sd_xl_base_1.0.safetensors sdxl_train_util.py:70
INFO building U-Net sdxl_model_util.py:192
2024-04-17 06:25:13 INFO loading U-Net from checkpoint sdxl_model_util.py:196
2024-04-17 06:25:22 INFO U-Net:
@bmaltais
Here is also the contents of tmpfilelora.toml:
bucket_no_upscale = true bucket_reso_steps = 64 caption_dropout_every_n_epochs = 0 caption_dropout_rate = 0 caption_extension = ".txt" clip_skip = 1 dynamo_backend = "no" enable_bucket = true epoch = 10 gradient_accumulation_steps = 1 gradient_checkpointing = true huber_c = 0.1 huber_schedule = "snr" keep_tokens = 0 learning_rate = 0.0003 logging_dir = "K:/AI/Training/Dataset/testwoman_v1/log" loss_type = "l2" lr_scheduler = "cosine" lr_scheduler_args = [] lr_scheduler_num_cycles = 1 lr_scheduler_power = 1 lr_warmup_steps = 0 max_bucket_reso = 2048 max_data_loader_n_workers = 0 max_grad_norm = 1 max_timestep = 1000 max_token_length = 150 max_train_steps = 3900 mem_eff_attn = true min_bucket_reso = 256 mixed_precision = "bf16" multires_noise_discount = 0 network_alpha = 32 network_args = [] network_dim = 64 network_dropout = 0 network_module = "networks.lora" no_half_vae = true noise_offset_type = "Original" optimizer_type = "Adafactor" optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False",] output_dir = "K:/AI/Training/Dataset/testwoman_v1/model" output_name = "testwoman_v1" pretrained_model_name_or_path = "K:/ArtificialIntelligenceModels/models/Stable-diffusion/sd_xl_base_1.0.safetensors" prior_loss_weight = 1 resolution = "1280,1280" sample_every_n_epochs = 1 sample_prompts = "K:/AI/Training/Dataset/testwoman_v1/model\prompt.txt" sample_sampler = "dpm_2" save_every_n_epochs = 1 save_model_as = "safetensors" save_precision = "bf16" scale_weight_norms = 0 text_encoder_lr = 0.0003 train_batch_size = 1 train_data_dir = "K:/AI/Training/Dataset/testwoman_v1/img" unet_lr = 0.0003 xformers = true
Everything look good. Honestly if the training run and it does not produce good results than this is a question for the sd-scripts repo issues page:
https://github.com/kohya-ss/sd-scripts/issues
You should provide the toml alongside the running command in the issue so it can be verified.
Maybe the GUI is not setting a value properly but the toml file look great...
What is your VRAM and System Memory graph look like when yu run? Are they topping up?
THe GUI use new versions of sd-scripts as kohya update them... so how the software run today is not the same as it used to a few days ago... things change quickly.
Double check that your samples are actually using the sample prompt. When I checked an hour ago the text contents was literally the filename. It didn't copy the prompt from the UI. Also it thinks I'm using FP16 even though I set it to BF16. I'll probably revert and try again later.
Yeah, there was an issue I fixed in the dev branch.. might be why
Ooh! Thanks for the quick response (in this forum and on your dev branch!)
epoch 1/10 steps: 10%|████████████████████████▉ | 390/3900 [48:00<7:12:07, 7.39s/it, avr_loss=0.132] saving checkpoint: K:/AI/Training/Dataset/test_woman-000001.safetensors 2024-04-17 07:13:54 INFO train_util.py:5130 INFO generating sample images at step / サンプル画像生成 ステップ: 390 train_util.py:5131 2024-04-17 07:13:55 INFO prompt: K:/AI/Training/Dataset/test_woman_v1/model train_util.py:5284 INFO negative_prompt: None train_util.py:5285 INFO height: 512 train_util.py:5286 INFO width: 512 train_util.py:5287 INFO sample_steps: 30 train_util.py:5288 INFO scale: 7.5 train_util.py:5289 INFO sample_sampler: dpm_2 train_util.py:5290 K:\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn(
epoch 2/10 steps: 20%|█████████████████████████████████████████████████▍ | 780/3900 [1:38:12<6:32:48, 7.55s/it, avr_loss=0.135] saving checkpoint: K:/AI/Training/Dataset/test_woman-000002.safetensors 2024-04-17 08:04:07 INFO train_util.py:5130 INFO generating sample images at step / サンプル画像生成 ステップ: 780 train_util.py:5131 2024-04-17 08:04:08 INFO prompt: K:/AI/Training/Dataset/test_woman_v1/model train_util.py:5284 INFO negative_prompt: None train_util.py:5285 INFO height: 512 train_util.py:5286 INFO width: 512 train_util.py:5287 INFO sample_steps: 30 train_util.py:5288 INFO scale: 7.5 train_util.py:5289 INFO sample_sampler: dpm_2 train_util.py:5290
epoch 3/10 steps: 25%|██████████████████████████████████████████████████████████████ | 980/3900 [2:06:38<6:17:21, 7.75s/it, avr_loss=0.126]Traceback (most recent call last): File "K:\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in
trainer.train(args)
File "K:\kohya_ss\sd-scripts\train_network.py", line 804, in train
for step, batch in enumerate(train_dataloader):
File "K:\kohya_ss\venv\lib\site-packages\accelerate\data_loader.py", line 458, in iter
next_batch = next(dataloader_iter)
File "K:\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 630, in next
data = self._next_data()
File "K:\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "K:\kohya_ss\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "K:\kohya_ss\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "K:\kohya_ss\venv\lib\site-packages\torch\utils\data\dataset.py", line 302, in getitem
return self.datasets[dataset_idx][sample_idx]
File "K:\kohya_ss\sd-scripts\library\train_util.py", line 1207, in getitem
img, face_cx, face_cy, face_w, face_h = self.load_image_with_face_info(subset, image_info.absolute_path)
File "K:\kohya_ss\sd-scripts\library\train_util.py", line 1092, in load_image_with_face_info
img = load_image(image_path)
File "K:\kohya_ss\sd-scripts\library\train_util.py", line 2352, in load_image
img = np.array(image, np.uint8)
File "K:\kohya_ss\venv\lib\site-packages\PIL\Image.py", line 681, in array_interface__
new["data"] = self.tobytes()
File "K:\kohya_ss\venv\lib\site-packages\PIL\Image.py", line 761, in tobytes
return b"".join(output)
MemoryError
steps: 25%|██████████████████████████████████████████████████████████████ | 980/3900 [2:06:39<6:17:22, 7.75s/it, avr_loss=0.126]
Traceback (most recent call last):
File "E:\anaconda3\envs\kohya_ss\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "E:\anaconda3\envs\kohya_ss\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "K:\kohya_ss\venv\Scripts\accelerate.exe\main__.py", line 7, in
File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
simple_launcher(args)
File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['K:\kohya_ss\venv\Scripts\python.exe', 'K:/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', './outputs/tmpfilelora.toml']' returned non-zero exit status 1.
08:32:33-254742 INFO Training has ended.
You can see above it is generating the sample 512x512 which is wrong as I have the following in the sample generation:
masterpiece, best quality, (test woman), solo, wearing black leather pants and a red tshirt, futuristic setting, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy, bad composition, poor, low effort --w 768 --h 1024--d 1 --l 4.0 --s 40
I hope that kohya_ss can be fixed it was working very good until I updated it and with all the changes related to security or something it seems to be completely broken and I have been unable to train loras for days now. I have been able to train sdxl loras with kohya_ss for many months so I know what I am doing when it comes to setting up settings for training and 16 GB vram used to be more than enough and I could even train with a batch size of 4 now even a batch size of 1 is using around 15500 VRAM. I hope @bmaltais can figure out what is responsible for the increases VRAM usage and why Kohya_ss appears to be completely broken under Windows 11 Pro command prompt (cmd) but also when you try with Anaconda Prompt and I am using the correct python version etc in both.