bmaltais / kohya_ss

Apache License 2.0
9.63k stars 1.24k forks source link

Lora training is fixated on 1600 steps it will not do less and not do more in latest version also sample generation is broken. #2303

Closed CRCODE22 closed 6 months ago

CRCODE22 commented 6 months ago

I find it hard to even know how to describe the issue exactly because it is so strange but I will show what is going on. It comes down to that it will always train at 1600 steps not less and not more like the formula system is broken also sample generation does not work anymore:

Training settings this part is still correct:

15:58:10-964058 INFO log_tracker_config not specified, skipping validation 15:58:10-965058 INFO resume not specified, skipping validation 15:58:10-966059 INFO vae not specified, skipping validation 15:58:10-967094 INFO lora_network_weights not specified, skipping validation 15:58:10-969063 INFO dataset_config not specified, skipping validation 15:58:10-971063 INFO Folder 2_test_woman: 195 images found 15:58:10-973060 INFO Folder 2_test_woman: 390 steps 15:58:10-974060 INFO Total steps: 390 15:58:10-975059 INFO Train batch size: 3 15:58:10-976063 INFO Gradient accumulation steps: 1 15:58:10-977060 INFO Epoch: 12 15:58:10-978059 INFO Regulatization factor: 1 15:58:10-979061 INFO stop_text_encoder_training = 0 15:58:10-980061 INFO lr_warmup_steps = 0 15:58:10-983059 INFO Saving training config to K:/AI/Training/Dataset/test_woman_v1/model\test_woman_v1_20240416-155810.json...

Then when it starts training it goes wrong it no longer sticks to the above:

running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 390 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 151 num epochs / epoch数: 11 batch size per device / バッチサイズ: 3 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1600 steps: 0%| | 0/1600 [00:00<?, ?it/s] epoch 1/11 steps: 0%|▍

jehe79 commented 6 months ago

Same issue here. I found that /kohya_ss/outputs/tmpfilelora.toml had the right settings but was not used.

CRCODE22 commented 6 months ago

It sounds to me that it is using fixed values and does not at all take in account what the user has customized. I have tried multiple different settings which mathematically can never go to 1600 steps but once you hit training it goes back to the fixated 1600 steps and does not use the customized values the user provided.

jehe79 commented 6 months ago

I guess you could try and run it manually. I reverted to commit 5bbb4fc (23.0.15) so I can't test dev right now.

Just add your paths for below inputs

accelerate launch --num_cpu_threads_per_process=2 "/[path]/kohya_ss/sd-scripts/train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --learning_rate="0.0002" --lr_scheduler="cosine" --lr_scheduler_num_cycles="8" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="768,768" --max_train_steps="8000" --mixed_precision="bf16" --network_alpha="128" --network_dim=128 --network_module=networks.lora --optimizer_type="AdamW8bit" --output_dir="/kohya_projects/object/model" --output_name="object-output-name" --pretrained_model_name_or_path="/path-to-checkpoint.safetensors" --reg_data_dir="/path-to-reg-images/" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --text_encoder_lr=0.0001 --train_batch_size="2" --train_data_dir="/path-to-images/img" --unet_lr=0.0002 --xformers --sample_sampler=euler_a --sample_prompts="/kohya_projects/object/model/sample/prompt.txt" --sample_every_n_steps=200

CRCODE22 commented 6 months ago

I guess you could try and run it manually. I reverted to commit 5bbb4fc (23.0.15) so I can't test dev right now.

Just add your paths for below inputs

accelerate launch --num_cpu_threads_per_process=2 "/[path]/kohya_ss/sd-scripts/train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --learning_rate="0.0002" --lr_scheduler="cosine" --lr_scheduler_num_cycles="8" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="768,768" --max_train_steps="8000" --mixed_precision="bf16" --network_alpha="128" --network_dim=128 --network_module=networks.lora --optimizer_type="AdamW8bit" --output_dir="/kohya_projects/object/model" --output_name="object-output-name" --pretrained_model_name_or_path="/path-to-checkpoint.safetensors" --reg_data_dir="/path-to-reg-images/" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --text_encoder_lr=0.0001 --train_batch_size="2" --train_data_dir="/path-to-images/img" --unet_lr=0.0002 --xformers --sample_sampler=euler_a --sample_prompts="/kohya_projects/object/model/sample/prompt.txt" --sample_every_n_steps=200

Is commit https://github.com/bmaltais/kohya_ss/commit/5bbb4fcf781f870a5cd58557dab87b4e2ef13c6d the older version that still works correcty? Before the updates I was using a kohya_ss version that worked very well but updates broke it it had been a while since I updated so I do not know which version of kohya_ss still worked properly but things have changed and kohya_ss is broken I wonder how long it will take to fix it but until I can figure out which kohya_ss still works I am going to use onetrainer. I need to train loras I already have a two days delay now.

CRCODE22 commented 6 months ago

I used git checkout 5bbb4fc git pull but that does not work

(base) E:\kohya_ss>git switch -c 5bbb4fc Switched to a new branch '5bbb4fc'

(base) E:\kohya_ss>git pull There is no tracking information for the current branch. Please specify which branch you want to merge with. See git-pull(1) for details.

CRCODE22 commented 6 months ago

Kohya_ss is broken even the git system man this is frustrating kohya_ss was great and my favorite until it stopped working.

jehe79 commented 6 months ago

Is commit 5bbb4fc the older version that still works correcty?

Afaik it's the last commit on 23.0.15 which works just fine. I did the checkout, activated my venv and installed requirements.txt and it fired up right away.. no need to pull/fetch.

CRCODE22 commented 6 months ago

Is commit 5bbb4fc the older version that still works correcty?

Afaik it's the last commit on 23.0.15 which works just fine. I did the checkout, activated my venv and installed requirements.txt and it fired up right away.. no need to pull/fetch.

Ok I think I found the release here https://github.com/bmaltais/kohya_ss/releases/tag/v23.0.15

I will try that one thank you.

CRCODE22 commented 6 months ago

@jehe79 that release does not work either:

C:\Users\CRCODE22\AppData\Local\Programs\Python\Python310\python.exe: can't open file 'K:\kohya_ss\sd-scripts\sdxl_train.py': [Errno 2] No such file or directory Traceback (most recent call last): File "C:\Users\CRCODE22\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\CRCODE22\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "K:\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

kohya_ss is very broken hopefully it will get fixed soon so I can use it again.

bmaltais commented 6 months ago

The number of steps is calculated by as-scripts. If it does not calculate it right there is not much I can do. You might have to report it directly to kohya on his sd -scripts repo.

bmaltais commented 6 months ago

Oh, the latest dev branch is still giving the can’t find file? Let me try to make both windows and Linux use shell=True… might be the solution.

bmaltais commented 6 months ago

I have implemented the finx in dev... hoping it will at least start the training. For the 1600 steps bug this is something kohya will need to check in his sd-scripts as to why it does that...

bmaltais commented 6 months ago

Gosh darn! I think the GUI is the culprit. I modified all the "string" fields to int or float... but my code is still expecting them as str and therefore all the if conditions are failing. Thank for raising this.

CRCODE22 commented 6 months ago

Oh, the latest dev branch is still giving the can’t find file? Let me try to make both windows and Linux use shell=True… might be the solution.

Oh no the latest dev branch works but has the bug with always using 1600 steps. The error with the missing file was this commit:

https://github.com/bmaltais/kohya_ss/releases/tag/v23.0.15

CRCODE22 commented 6 months ago

Gosh darn! I think the GUI is the culprit. I modified all the "string" fields to int or float... but my code is still expecting them as str and therefore all the if conditions are failing. Thank for raising this.

You are welcome hopefully you can fix it soon :)

bmaltais commented 6 months ago

So much of the code has changed because of that security report... it is hard to keep track. I need to stabilise it before I make any more improvements...

CRCODE22 commented 6 months ago

So much of the code has changed because of that security report... it is hard to keep track. I need to stabilise it before I make any more improvements...

Any ideas on which of your releases I can use that will still have SDXL lora training working under Windows 11 Pro?

CRCODE22 commented 6 months ago

I am currently testing out V23.1.3 release but that does not look promising a lot of errors:

ImportError: accelerate>=0.20.3 is required for a normal functioning of this module, but found accelerate==0.18.0. Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git main

18:34:02-100799 INFO Kohya_ss GUI version: v23.1.3 fatal: not a git repository (or any of the parent directories): .git 18:34:02-552756 ERROR Error during Git operation: Command '['git', 'submodule', 'update', '--init', '--recursive', '--quiet']' returned non-zero exit status 128.

Running on local URL: http://127.0.0.1:7862

To create a public link, set share=True in launch(). 18:51:44-758751 INFO Loading config... K:\kohya_ss-23.1.3\venv\lib\site-packages\gradio\components\dropdown.py:231: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: 150 or set allow_custom_value=True. warnings.warn(

CRCODE22 commented 6 months ago

I was using a Kohya version before that worked great but it was a few months old but it did what I needed it to do but I do not know which release it was because the git pull earlier updated it to the latest version and since then Kohya_ss no longer works. If I had known updating would break my installation I would not have updated. Since then going several releases back they also not work.

bmaltais commented 6 months ago

You can easilly go back to a previous release with:

git checkout <release name>

You just need to find the desired release name and use that to go back... best to do that in a freshly cloned repo and then run setup.

bmaltais commented 6 months ago

I have pushed an update to dev... I hope it fixes the 1600 steps issue.

CRCODE22 commented 6 months ago

I have pushed an update to dev... I hope it fixes the 1600 steps issue.

It has fixed that problem but another problem is occuring now because the same json that worked in the older version before at even 4 batch size even with a batch size of 2 now is running out of memory even tough there is plenty of VRAM available to Kohya_ss

Have you made changes that make it use much more VRAM compared to older kohya_ss version lets say from 2 months ago?

prepare optimizer, data loader etc. INFO use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False} train_util.py:4047 WARNING because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / train_util.py:4075 max_grad_normが設定されているためclip_grad_normが有効になります。0に設定して無効にしたほうがいいかもしれません WARNING constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません train_util.py:4079 running training / 学習開始 num train images repeats / 学習画像の数×繰り返し回数: 390 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 195 num epochs / epoch数: 10 batch size per device / バッチサイズ: 2 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1950 steps: 0%| | 0/1950 [00:00<?, ?it/s] epoch 1/10 Traceback (most recent call last): File "K:\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in trainer.train(args) File "K:\kohya_ss\sd-scripts\train_network.py", line 864, in train noise_pred = self.call_unet( File "K:\kohya_ss\sd-scripts\sdxl_train_network.py", line 164, in call_unet noise_pred = unet(noisy_latents, timesteps, text_embedding, vector_embedding) File "K:\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "K:\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "K:\kohya_ss\venv\lib\site-packages\accelerate\utils\operations.py", line 680, in forward return model_forward(args, kwargs) File "K:\kohya_ss\venv\lib\site-packages\accelerate\utils\operations.py", line 668, in call return convert_to_fp32(self.model_forward(*args, kwargs)) File "K:\kohya_ss\venv\lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast return func(*args, *kwargs) File "K:\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 1107, in forward h = call_module(self.middle_block, h, emb, context) File "K:\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 1095, in call_module x = layer(x, context) File "K:\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "K:\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "K:\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 750, in forward hidden_states = block(hidden_states, context=encoder_hidden_states, timestep=timestep) File "K:\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "K:\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "K:\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 673, in forward output = self.forward_body(hidden_states, context, timestep) File "K:\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 648, in forward_body hidden_states = self.attn1(norm_hidden_states) + hidden_states File "K:\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "K:\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "K:\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 445, in forward return self.forward_memory_efficient_mem_eff(hidden_states, context, mask) File "K:\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 530, in forward_memory_efficient_mem_eff out = flash_func.apply(q, k, v, mask, False, q_bucket_size, k_bucket_size) File "K:\kohya_ss\venv\lib\site-packages\torch\autograd\function.py", line 539, in apply return super().apply(args, kwargs) # type: ignore[misc] File "K:\kohya_ss\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "K:\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 130, in forward exp_weights = torch.exp(attn_weights) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 16.00 GiB of which 0 bytes is free. Of the allocated memory 17.84 GiB is allocated by PyTorch, and 344.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0%| | 0/1950 [00:14<?, ?it/s] Traceback (most recent call last): File "C:\Users\CRCODE22\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\CRCODE22\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "K:\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['K:\kohya_ss\venv\Scripts\python.exe', 'K:/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', './outputs/tmpfilelora.toml', '--max_data_loader_n_workers', '0']' returned non-zero exit status 1. 23:18:29-385588 INFO Training has ended.

CRCODE22 commented 6 months ago

@bmaltais It is working now it appears that loading an older .json file from kohya_ss in the latest version does not bring over all the settings for example I had to manually enable gradient checkpointing again and correct several other settings it is working now thank you for fixing the steps problem :)

I will let you know if sample generation works.

0it [00:00, ?it/s] 2024-04-16 23:23:48 INFO create LoRA network. base dim (rank): 64, alpha: 32 lora.py:810 INFO neuron dropout: p=None, rank dropout: p=None, module dropout: p=None lora.py:811 INFO create LoRA for Text Encoder 1: lora.py:902 INFO create LoRA for Text Encoder 2: lora.py:902 INFO create LoRA for Text Encoder: 264 modules. lora.py:910 2024-04-16 23:23:49 INFO create LoRA for U-Net: 722 modules. lora.py:918 INFO enable LoRA for text encoder lora.py:961 INFO enable LoRA for U-Net lora.py:966 prepare optimizer, data loader etc. INFO use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False} train_util.py:4047 WARNING because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / train_util.py:4075 max_grad_normが設定されているためclip_grad_normが有効になります。0に設定して無効にしたほうがいいかもしれません WARNING constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません train_util.py:4079 running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 390 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 195 num epochs / epoch数: 10 batch size per device / バッチサイズ: 2 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1950 steps: 0%| | 0/1950 [00:00<?, ?it/s] epoch 1/10 steps: 0%|▎ | 4/1950 [00:41<5:36:07, 10.36s/it, avr_loss=0.142]

bmaltais commented 6 months ago

I will check the import code… not sure why it did not load the gradient checkpointing value… maybe I will discover another unexpected code issue.

CRCODE22 commented 6 months ago

I might have disabled the gradient check pointing earlier when I was encountering errors so I cannot be certain it is a code issue on your end but the Network Rank (Dimension) and Network Alpha for example those values did not carry over correctly.

CRCODE22 commented 6 months ago

@bmaltais this is very bad it goes wrong when it finished 1 epoch.

steps: 10%|████████████ | 195/1950 [31:37<4:44:34, 9.73s/it, avr_loss=0.126] saving checkpoint: K:/AI/Training/Dataset/testwoman_v1/model\testwoman_v1-000001.safetensors MemoryError thread '' panicked at C:\Users\runneradmin.cargo\registry\src\index.crates.io-6f17d22bba15001f\pyo3-0.20.2\src\err\mod.rs:788:5: Python API call failed note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Traceback (most recent call last): File "K:\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in trainer.train(args) File "K:\kohya_ss\sd-scripts\train_network.py", line 970, in train save_model(ckpt_name, accelerator.unwrap_model(network), global_step, epoch + 1) File "K:\kohya_ss\sd-scripts\train_network.py", line 782, in save_model unwrapped_nw.save_weights(ckpt_file, save_dtype, metadata_to_save) File "K:\kohya_ss\sd-scripts\networks\lora.py", line 1115, in save_weights model_hash, legacy_hash = train_util.precalculate_safetensors_hashes(state_dict, metadata) File "K:\kohya_ss\sd-scripts\library\train_util.py", line 2561, in precalculate_safetensors_hashes bytes = safetensors.torch.save(tensors, metadata) File "K:\kohya_ss\venv\lib\site-packages\safetensors\torch.py", line 245, in save serialized = serialize(_flatten(tensors), metadata=metadata) pyo3_runtime.PanicException: Python API call failed steps: 10%|████████████ | 195/1950 [31:39<4:44:55, 9.74s/it, avr_loss=0.126] Traceback (most recent call last): File "C:\Users\CRCODE22\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\CRCODE22\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "K:\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "K:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['K:\kohya_ss\venv\Scripts\python.exe', 'K:/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', './outputs/tmpfilelora.toml', '--max_data_loader_n_workers', '0']' returned non-zero exit status 1. 23:55:49-098961 INFO Training has ended.

CRCODE22 commented 6 months ago

My computer has 128GB RAM and 16GB VRAM so there should not be an out of memory error.

bmaltais commented 6 months ago

Have you monitored RAM and GPU VRAM usage during the training? Are any showing signs of reaching capacity?

Also, I pushed an update to dev that should address possible parameters issues when sent to sd-scripts.

bmaltais commented 6 months ago

Always possible… but 16GB VRAM is not much… I have 24 and can barely train SDXL

CRCODE22 commented 6 months ago

@bmaltais 16GB is plenty to train SDXL with 3 batch size the way I have configured it I also keep track of vram usage with GPU-Z it does not reach 16GB VRAM

Another problem with the latest dev now is that it has a fixed value of Epoch 0

override steps. steps for 0 epochs is / 指定エポックまでのステップ数: 0 running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 390 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 195 num epochs / epoch数: 0 batch size per device / バッチサイズ: 2 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 0 steps: 0it [00:00, ?it/s] saving checkpoint: K:/AI/Training/Dataset/test_woman_v1/model\test_woman_sd_xl_base_1280_1280_64_32_v1.safetensors 2024-04-17 05:30:45 INFO model saved. train_network.py:999 steps: 0it [00:03, ?it/s]

This is the settings it should be using:

05:27:00-143872 INFO log_tracker_config not specified, skipping validation 05:27:00-144880 INFO resume not specified, skipping validation 05:27:00-145879 INFO vae not specified, skipping validation 05:27:00-146881 INFO lora_network_weights not specified, skipping validation 05:27:00-148881 INFO dataset_config not specified, skipping validation 05:27:00-150881 INFO Folder 2_test_woman woman: 195 images found 05:27:00-152391 INFO Folder 2_test_woman woman: 390 steps 05:27:00-153908 INFO Total steps: 390 05:27:00-154916 INFO Train batch size: 2 05:27:00-155916 INFO Gradient accumulation steps: 1 05:27:00-157915 INFO Epoch: 12 05:27:00-158915 INFO Regulatization factor: 1 05:27:00-159916 INFO max_train_steps (390 / 2 / 1 12 1) = 2340 05:27:00-160916 INFO stop_text_encoder_training = 0 05:27:00-163432 INFO lr_warmup_steps = 0 05:27:00-164952 INFO Saving training config to K:/AI/Training/Dataset/test_woman_v1/model\test_woman_v1_20240417-052700.json...

bmaltais commented 6 months ago

Humm, weird, will search for why it does that

bmaltais commented 6 months ago

That should have been fixed in a commit I made last night. Did you du a git pull of the dev branch?

bmaltais commented 6 months ago

I have tested with an old config that had the values as text string and it worked fine. Must be that you used an older commit of the dev branch that had the bug. Also, note that sd-scripts are setting the max_train_steps at 1600 if not specified. This is where the 1600 come from ;-) So if a user does not specify the value it will be set to 1600 by sd-scripts.