Closed CRCODE22 closed 6 months ago
Same issue here. I found that /kohya_ss/outputs/tmpfilelora.toml had the right settings but was not used.
It sounds to me that it is using fixed values and does not at all take in account what the user has customized. I have tried multiple different settings which mathematically can never go to 1600 steps but once you hit training it goes back to the fixated 1600 steps and does not use the customized values the user provided.
I guess you could try and run it manually. I reverted to commit 5bbb4fc (23.0.15) so I can't test dev right now.
Just add your paths for below inputs
accelerate launch --num_cpu_threads_per_process=2 "/[path]/kohya_ss/sd-scripts/train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --learning_rate="0.0002" --lr_scheduler="cosine" --lr_scheduler_num_cycles="8" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="768,768" --max_train_steps="8000" --mixed_precision="bf16" --network_alpha="128" --network_dim=128 --network_module=networks.lora --optimizer_type="AdamW8bit" --output_dir="/kohya_projects/object/model" --output_name="object-output-name" --pretrained_model_name_or_path="/path-to-checkpoint.safetensors" --reg_data_dir="/path-to-reg-images/" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --text_encoder_lr=0.0001 --train_batch_size="2" --train_data_dir="/path-to-images/img" --unet_lr=0.0002 --xformers --sample_sampler=euler_a --sample_prompts="/kohya_projects/object/model/sample/prompt.txt" --sample_every_n_steps=200
I guess you could try and run it manually. I reverted to commit 5bbb4fc (23.0.15) so I can't test dev right now.
Just add your paths for below inputs
accelerate launch --num_cpu_threads_per_process=2 "/[path]/kohya_ss/sd-scripts/train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --learning_rate="0.0002" --lr_scheduler="cosine" --lr_scheduler_num_cycles="8" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="768,768" --max_train_steps="8000" --mixed_precision="bf16" --network_alpha="128" --network_dim=128 --network_module=networks.lora --optimizer_type="AdamW8bit" --output_dir="/kohya_projects/object/model" --output_name="object-output-name" --pretrained_model_name_or_path="/path-to-checkpoint.safetensors" --reg_data_dir="/path-to-reg-images/" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --text_encoder_lr=0.0001 --train_batch_size="2" --train_data_dir="/path-to-images/img" --unet_lr=0.0002 --xformers --sample_sampler=euler_a --sample_prompts="/kohya_projects/object/model/sample/prompt.txt" --sample_every_n_steps=200
Is commit https://github.com/bmaltais/kohya_ss/commit/5bbb4fcf781f870a5cd58557dab87b4e2ef13c6d the older version that still works correcty? Before the updates I was using a kohya_ss version that worked very well but updates broke it it had been a while since I updated so I do not know which version of kohya_ss still worked properly but things have changed and kohya_ss is broken I wonder how long it will take to fix it but until I can figure out which kohya_ss still works I am going to use onetrainer. I need to train loras I already have a two days delay now.
I used git checkout 5bbb4fc git pull but that does not work
(base) E:\kohya_ss>git switch -c 5bbb4fc Switched to a new branch '5bbb4fc'
(base) E:\kohya_ss>git pull There is no tracking information for the current branch. Please specify which branch you want to merge with. See git-pull(1) for details.
Kohya_ss is broken even the git system man this is frustrating kohya_ss was great and my favorite until it stopped working.
Is commit 5bbb4fc the older version that still works correcty?
Afaik it's the last commit on 23.0.15 which works just fine. I did the checkout, activated my venv and installed requirements.txt and it fired up right away.. no need to pull/fetch.
Is commit 5bbb4fc the older version that still works correcty?
Afaik it's the last commit on 23.0.15 which works just fine. I did the checkout, activated my venv and installed requirements.txt and it fired up right away.. no need to pull/fetch.
Ok I think I found the release here https://github.com/bmaltais/kohya_ss/releases/tag/v23.0.15
I will try that one thank you.
@jehe79 that release does not work either:
C:\Users\CRCODE22\AppData\Local\Programs\Python\Python310\python.exe: can't open file 'K:\kohya_ss\sd-scripts\sdxl_train.py': [Errno 2] No such file or directory
Traceback (most recent call last):
File "C:\Users\CRCODE22\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\CRCODE22\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "K:\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in
kohya_ss is very broken hopefully it will get fixed soon so I can use it again.
The number of steps is calculated by as-scripts. If it does not calculate it right there is not much I can do. You might have to report it directly to kohya on his sd -scripts repo.
Oh, the latest dev branch is still giving the can’t find file? Let me try to make both windows and Linux use shell=True… might be the solution.
I have implemented the finx in dev
... hoping it will at least start the training. For the 1600 steps bug this is something kohya will need to check in his sd-scripts as to why it does that...
Gosh darn! I think the GUI is the culprit. I modified all the "string" fields to int or float... but my code is still expecting them as str and therefore all the if conditions are failing. Thank for raising this.
Oh, the latest dev branch is still giving the can’t find file? Let me try to make both windows and Linux use shell=True… might be the solution.
Oh no the latest dev branch works but has the bug with always using 1600 steps. The error with the missing file was this commit:
Gosh darn! I think the GUI is the culprit. I modified all the "string" fields to int or float... but my code is still expecting them as str and therefore all the if conditions are failing. Thank for raising this.
You are welcome hopefully you can fix it soon :)
So much of the code has changed because of that security report... it is hard to keep track. I need to stabilise it before I make any more improvements...
So much of the code has changed because of that security report... it is hard to keep track. I need to stabilise it before I make any more improvements...
Any ideas on which of your releases I can use that will still have SDXL lora training working under Windows 11 Pro?
I am currently testing out V23.1.3 release but that does not look promising a lot of errors:
ImportError: accelerate>=0.20.3 is required for a normal functioning of this module, but found accelerate==0.18.0. Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git main
18:34:02-100799 INFO Kohya_ss GUI version: v23.1.3 fatal: not a git repository (or any of the parent directories): .git 18:34:02-552756 ERROR Error during Git operation: Command '['git', 'submodule', 'update', '--init', '--recursive', '--quiet']' returned non-zero exit status 128.
Running on local URL: http://127.0.0.1:7862
To create a public link, set share=True
in launch()
.
18:51:44-758751 INFO Loading config...
K:\kohya_ss-23.1.3\venv\lib\site-packages\gradio\components\dropdown.py:231: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: 150 or set allow_custom_value=True.
warnings.warn(
I was using a Kohya version before that worked great but it was a few months old but it did what I needed it to do but I do not know which release it was because the git pull earlier updated it to the latest version and since then Kohya_ss no longer works. If I had known updating would break my installation I would not have updated. Since then going several releases back they also not work.
You can easilly go back to a previous release with:
git checkout <release name>
You just need to find the desired release name and use that to go back... best to do that in a freshly cloned repo and then run setup.
I have pushed an update to dev
... I hope it fixes the 1600 steps issue.
I have pushed an update to
dev
... I hope it fixes the 1600 steps issue.
It has fixed that problem but another problem is occuring now because the same json that worked in the older version before at even 4 batch size even with a batch size of 2 now is running out of memory even tough there is plenty of VRAM available to Kohya_ss
Have you made changes that make it use much more VRAM compared to older kohya_ss version lets say from 2 months ago?
prepare optimizer, data loader etc.
INFO use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False} train_util.py:4047
WARNING because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / train_util.py:4075
max_grad_normが設定されているためclip_grad_normが有効になります。0に設定して無効にしたほうがいいかもしれません
WARNING constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません train_util.py:4079
running training / 学習開始
num train images repeats / 学習画像の数×繰り返し回数: 390
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 195
num epochs / epoch数: 10
batch size per device / バッチサイズ: 2
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 1950
steps: 0%| | 0/1950 [00:00<?, ?it/s]
epoch 1/10
Traceback (most recent call last):
File "K:\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in
@bmaltais It is working now it appears that loading an older .json file from kohya_ss in the latest version does not bring over all the settings for example I had to manually enable gradient checkpointing again and correct several other settings it is working now thank you for fixing the steps problem :)
I will let you know if sample generation works.
0it [00:00, ?it/s] 2024-04-16 23:23:48 INFO create LoRA network. base dim (rank): 64, alpha: 32 lora.py:810 INFO neuron dropout: p=None, rank dropout: p=None, module dropout: p=None lora.py:811 INFO create LoRA for Text Encoder 1: lora.py:902 INFO create LoRA for Text Encoder 2: lora.py:902 INFO create LoRA for Text Encoder: 264 modules. lora.py:910 2024-04-16 23:23:49 INFO create LoRA for U-Net: 722 modules. lora.py:918 INFO enable LoRA for text encoder lora.py:961 INFO enable LoRA for U-Net lora.py:966 prepare optimizer, data loader etc. INFO use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False} train_util.py:4047 WARNING because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / train_util.py:4075 max_grad_normが設定されているためclip_grad_normが有効になります。0に設定して無効にしたほうがいいかもしれません WARNING constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません train_util.py:4079 running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 390 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 195 num epochs / epoch数: 10 batch size per device / バッチサイズ: 2 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1950 steps: 0%| | 0/1950 [00:00<?, ?it/s] epoch 1/10 steps: 0%|▎ | 4/1950 [00:41<5:36:07, 10.36s/it, avr_loss=0.142]
I will check the import code… not sure why it did not load the gradient checkpointing value… maybe I will discover another unexpected code issue.
I might have disabled the gradient check pointing earlier when I was encountering errors so I cannot be certain it is a code issue on your end but the Network Rank (Dimension) and Network Alpha for example those values did not carry over correctly.
@bmaltais this is very bad it goes wrong when it finished 1 epoch.
steps: 10%|████████████ | 195/1950 [31:37<4:44:34, 9.73s/it, avr_loss=0.126]
saving checkpoint: K:/AI/Training/Dataset/testwoman_v1/model\testwoman_v1-000001.safetensors
MemoryError
thread 'RUST_BACKTRACE=1
environment variable to display a backtrace
Traceback (most recent call last):
File "K:\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in
My computer has 128GB RAM and 16GB VRAM so there should not be an out of memory error.
Have you monitored RAM and GPU VRAM usage during the training? Are any showing signs of reaching capacity?
Also, I pushed an update to dev
that should address possible parameters issues when sent to sd-scripts.
Always possible… but 16GB VRAM is not much… I have 24 and can barely train SDXL
@bmaltais 16GB is plenty to train SDXL with 3 batch size the way I have configured it I also keep track of vram usage with GPU-Z it does not reach 16GB VRAM
Another problem with the latest dev now is that it has a fixed value of Epoch 0
override steps. steps for 0 epochs is / 指定エポックまでのステップ数: 0 running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 390 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 195 num epochs / epoch数: 0 batch size per device / バッチサイズ: 2 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 0 steps: 0it [00:00, ?it/s] saving checkpoint: K:/AI/Training/Dataset/test_woman_v1/model\test_woman_sd_xl_base_1280_1280_64_32_v1.safetensors 2024-04-17 05:30:45 INFO model saved. train_network.py:999 steps: 0it [00:03, ?it/s]
This is the settings it should be using:
05:27:00-143872 INFO log_tracker_config not specified, skipping validation 05:27:00-144880 INFO resume not specified, skipping validation 05:27:00-145879 INFO vae not specified, skipping validation 05:27:00-146881 INFO lora_network_weights not specified, skipping validation 05:27:00-148881 INFO dataset_config not specified, skipping validation 05:27:00-150881 INFO Folder 2_test_woman woman: 195 images found 05:27:00-152391 INFO Folder 2_test_woman woman: 390 steps 05:27:00-153908 INFO Total steps: 390 05:27:00-154916 INFO Train batch size: 2 05:27:00-155916 INFO Gradient accumulation steps: 1 05:27:00-157915 INFO Epoch: 12 05:27:00-158915 INFO Regulatization factor: 1 05:27:00-159916 INFO max_train_steps (390 / 2 / 1 12 1) = 2340 05:27:00-160916 INFO stop_text_encoder_training = 0 05:27:00-163432 INFO lr_warmup_steps = 0 05:27:00-164952 INFO Saving training config to K:/AI/Training/Dataset/test_woman_v1/model\test_woman_v1_20240417-052700.json...
Humm, weird, will search for why it does that
That should have been fixed in a commit I made last night. Did you du a git pull of the dev branch?
I have tested with an old config that had the values as text string and it worked fine. Must be that you used an older commit of the dev branch that had the bug. Also, note that sd-scripts are setting the max_train_steps at 1600 if not specified. This is where the 1600 come from ;-) So if a user does not specify the value it will be set to 1600 by sd-scripts.
I find it hard to even know how to describe the issue exactly because it is so strange but I will show what is going on. It comes down to that it will always train at 1600 steps not less and not more like the formula system is broken also sample generation does not work anymore:
Training settings this part is still correct:
15:58:10-964058 INFO log_tracker_config not specified, skipping validation 15:58:10-965058 INFO resume not specified, skipping validation 15:58:10-966059 INFO vae not specified, skipping validation 15:58:10-967094 INFO lora_network_weights not specified, skipping validation 15:58:10-969063 INFO dataset_config not specified, skipping validation 15:58:10-971063 INFO Folder 2_test_woman: 195 images found 15:58:10-973060 INFO Folder 2_test_woman: 390 steps 15:58:10-974060 INFO Total steps: 390 15:58:10-975059 INFO Train batch size: 3 15:58:10-976063 INFO Gradient accumulation steps: 1 15:58:10-977060 INFO Epoch: 12 15:58:10-978059 INFO Regulatization factor: 1 15:58:10-979061 INFO stop_text_encoder_training = 0 15:58:10-980061 INFO lr_warmup_steps = 0 15:58:10-983059 INFO Saving training config to K:/AI/Training/Dataset/test_woman_v1/model\test_woman_v1_20240416-155810.json...
Then when it starts training it goes wrong it no longer sticks to the above:
running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 390 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 151 num epochs / epoch数: 11 batch size per device / バッチサイズ: 3 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1600 steps: 0%| | 0/1600 [00:00<?, ?it/s] epoch 1/11 steps: 0%|▍