Open Hibiki82 opened 4 months ago
Can you do a git pull to get the new version? You appear to use an older release
It's up to date.
Microsoft Windows [Version 10.0.19045.4291] (c) Microsoft Corporation. All rights reserved.
F:\Kohya\kohya_ss>git pull Already up to date.
Another Run Test:
INFO make buckets train_util.py:859
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is train_util.py:876
set, because bucket reso is defined by image size automatically /
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
算されるため、min_bucket_resoとmax_bucket_resoは無視されます
INFO number of images (including repeats) / train_util.py:905
各bucketの画像枚数(繰り返し回数を含む)
INFO bucket 0: resolution (384, 768), count: 180 train_util.py:910
INFO mean ar error (without repeats): 0.06233766233766236 train_util.py:915
INFO preparing accelerator train_network.py:225
accelerator device: cpu
INFO loading model for process 0/1 train_util.py:4385
INFO load StableDiffusion checkpoint: train_util.py:4341
F:/Kohya/kohya_ss/models/SD-v1.5-pruned.ckpt
2024-04-18 02:31:50 INFO UNet2DConditionModel: 64, 8, 768, False, False original_unet.py:1387
2024-04-18 02:31:55 INFO loading u-net:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "F:\Kohya\kohya_ss\sd-scripts\train_network.py", line 1115, in
It always ends in: caching latents...
It is complaining of xformers module not being installed… this is probably the issue. Why is it not installing. Formers? No idea.
It is complaining of xformers module not being installed… this is probably the issue. Why is it not installing. Formers? No idea.
I followed the instruction and for some reason I couldn't get the same result as them. I'll try to manually install xFormers.
Still the same problem accrued.
uninstalled the existing files: pip uninstall torch torchvision xformers -y
reinstalling xFormers Torchvision and Pytorch: pip install torch==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118 pip install torchvision==0.16.2+cu118 --index-url https://download.pytorch.org/whl/cu118 pip install xformers==0.0.23.post1+cu118 --index-url https://download.pytorch.org/whl/cu118
accelerator device: cuda
INFO loading model for process 0/1 train_util.py:4385
INFO load StableDiffusion checkpoint: train_util.py:4341
F:/Kohya/kohya_ss/models/SD-v1.5-pruned.ckpt
2024-04-19 03:02:51 INFO UNet2DConditionModel: 64, 8, 768, False, False original_unet.py:1387
2024-04-19 03:02:56 INFO loading u-net:
Make sure to select no-half-vae... the NaN error for latents is beause youneed to set no-half-vae to true.
I don't see the no-half-vae option in the gui, is it in the python script?
Hummm... true... you are training an sd1.5 model... not sure why it return NaN then... mighht have to reach out to the sd-scripts issues page to ask there as this appear to be an sd-script issue...
I guess I'll have to create a issue on sd-scripts issues page. :(
Thanks for Troubleshooting my issue.
The good thing is that you will be able to provide the toml to kohya so he can better understand what is triggering the traceback and how he might trap it in his code.
For some reason is training right now, when I switch from 1.5 to XL. I'm waiting on the result.
Same issue for me, when running with stabilityai/stable-diffusion-xl-base-1.0
INFO caching latents... train_util.py:1021 0%| | 0/18 [00:01<?, ?it/s] Traceback (most recent call last): File "F:\projects\LoRa\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in <module> trainer.train(args) File "F:\projects\LoRa\kohya_ss\sd-scripts\train_network.py", line 273, in train train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "F:\projects\LoRa\kohya_ss\sd-scripts\library\train_util.py", line 2080, in cache_latents dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process) File "F:\projects\LoRa\kohya_ss\sd-scripts\library\train_util.py", line 1023, in cache_latents cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop) File "F:\projects\LoRa\kohya_ss\sd-scripts\library\train_util.py", line 2428, in cache_batch_latents raise RuntimeError(f"NaN detected in latents: {info.absolute_path}")
tried with stabilityai/stable-diffusion-2-base
. it works..
Can you share the json file for the training?
I have the same issue. When I create a training from scratch it works. When I load a config file I get the error. config.json
It keeps stopping at this part and never completes the training. What is the issue here? I've reinstalled it many times but still see no results.
import network module: lycoris.kohya 2024-04-17 06:54:29 INFO [Dataset 0] train_util.py:2079 INFO caching latents. train_util.py:974 INFO checking cache validity... train_util.py:984100%|██████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<?, ?it/s] INFO caching latents... train_util.py:1021 0%| | 0/18 [00:02<?, ?it/s] Traceback (most recent call last): File "F:\Kohya\kohya_ss\sd-scripts\train_network.py", line 1115, in
trainer.train(args)
File "F:\Kohya\kohya_ss\sd-scripts\train_network.py", line 272, in train
train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 2080, in cache_latents
dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process)
File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 1023, in cache_latents
cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop)
File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 2428, in cache_batch_latents
raise RuntimeError(f"NaN detected in latents: {info.absolute_path}")
RuntimeError: NaN detected in latents: F:\Kohya\kohya_ss\Image\houkisei\image\10_houkisei__arlecchino_genshin_impact_drawn_by_houkisei__01f00b735e7d543cee6662af42e74343.jpg
Traceback (most recent call last):
File "C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "F:\Kohya\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in
File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
simple_launcher(args)
File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['F:\Kohya\kohya_ss\venv\Scripts\python.exe', 'F:\Kohya\kohya_ss/sd-scripts/train_network.py', '--bucket_no_upscale', '--bucket_reso_steps=64', '--cache_latents', '--caption_dropout_rate=0.5', '--caption_extension=.txt', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--gradient_checkpointing', '--huber_c=0.1', '--huber_schedule=snr', '--keep_tokens=1', '--learning_rate=1.0', '--loss_type=l2', '--lr_scheduler=constant', '--lr_scheduler_num_cycles=300', '--max_data_loader_n_workers=0', '--max_grad_norm=1', '--resolution=433,770', '--max_train_steps=1800', '--min_snr_gamma=5', '--min_timestep=0', '--mixed_precision=fp16', '--network_alpha=1024', '--network_args', 'preset=full', 'conv_dim=1', 'conv_alpha=1', 'train_on_input=True', 'algo=ia3', '--network_dim=1024', '--network_dropout=0.3', '--network_module=lycoris.kohya', '--noise_offset=0.05', '--adaptive_noise_scale=0.005', '--optimizer_args', 'd_coef=1.0', 'weight_decay=0.01', 'safeguard_warmup=False', 'use_bias_correction=False', '--optimizer_type=Prodigy', '--output_dir=F:/Kohya/kohya_ss/outputs', '--output_name=last', '--pretrained_model_name_or_path=F:/Kohya/kohya_ss/models/SD-v1.5-pruned.ckpt', '--save_every_n_epochs=10', '--save_model_as=safetensors', '--save_precision=fp16', '--scale_weight_norms=1', '--seed=31337', '--shuffle_caption', '--text_encoder_lr=1', '--train_batch_size=1', '--training_comment=rentry.co/ProdiAgy', '--train_data_dir=F:/Kohya/kohya_ss/Image/houkisei/image', '--unet_lr=1', '--xformers']' returned non-zero exit status 1.