ERROR Trying to train LORA Model on my PC, HELP!!

Hibiki82 commented 4 months ago

It keeps stopping at this part and never completes the training. What is the issue here? I've reinstalled it many times but still see no results.

import network module: lycoris.kohya 2024-04-17 06:54:29 INFO [Dataset 0] train_util.py:2079 INFO caching latents. train_util.py:974 INFO checking cache validity... train_util.py:984100%|██████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<?, ?it/s] INFO caching latents... train_util.py:1021 0%| | 0/18 [00:02<?, ?it/s] Traceback (most recent call last): File "F:\Kohya\kohya_ss\sd-scripts\train_network.py", line 1115, in trainer.train(args) File "F:\Kohya\kohya_ss\sd-scripts\train_network.py", line 272, in train train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 2080, in cache_latents dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process) File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 1023, in cache_latents cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop) File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 2428, in cache_batch_latents raise RuntimeError(f"NaN detected in latents: {info.absolute_path}") RuntimeError: NaN detected in latents: F:\Kohya\kohya_ss\Image\houkisei\image\10_houkisei__arlecchino_genshin_impact_drawn_by_houkisei__01f00b735e7d543cee6662af42e74343.jpg Traceback (most recent call last): File "C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "F:\Kohya\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['F:\Kohya\kohya_ss\venv\Scripts\python.exe', 'F:\Kohya\kohya_ss/sd-scripts/train_network.py', '--bucket_no_upscale', '--bucket_reso_steps=64', '--cache_latents', '--caption_dropout_rate=0.5', '--caption_extension=.txt', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--gradient_checkpointing', '--huber_c=0.1', '--huber_schedule=snr', '--keep_tokens=1', '--learning_rate=1.0', '--loss_type=l2', '--lr_scheduler=constant', '--lr_scheduler_num_cycles=300', '--max_data_loader_n_workers=0', '--max_grad_norm=1', '--resolution=433,770', '--max_train_steps=1800', '--min_snr_gamma=5', '--min_timestep=0', '--mixed_precision=fp16', '--network_alpha=1024', '--network_args', 'preset=full', 'conv_dim=1', 'conv_alpha=1', 'train_on_input=True', 'algo=ia3', '--network_dim=1024', '--network_dropout=0.3', '--network_module=lycoris.kohya', '--noise_offset=0.05', '--adaptive_noise_scale=0.005', '--optimizer_args', 'd_coef=1.0', 'weight_decay=0.01', 'safeguard_warmup=False', 'use_bias_correction=False', '--optimizer_type=Prodigy', '--output_dir=F:/Kohya/kohya_ss/outputs', '--output_name=last', '--pretrained_model_name_or_path=F:/Kohya/kohya_ss/models/SD-v1.5-pruned.ckpt', '--save_every_n_epochs=10', '--save_model_as=safetensors', '--save_precision=fp16', '--scale_weight_norms=1', '--seed=31337', '--shuffle_caption', '--text_encoder_lr=1', '--train_batch_size=1', '--training_comment=rentry.co/ProdiAgy', '--train_data_dir=F:/Kohya/kohya_ss/Image/houkisei/image', '--unet_lr=1', '--xformers']' returned non-zero exit status 1.

bmaltais commented 4 months ago

Can you do a git pull to get the new version? You appear to use an older release

Hibiki82 commented 4 months ago

It's up to date.

F:\Kohya\kohya_ss>git pull Already up to date.

Another Run Test:

                INFO     make buckets                                                              train_util.py:859
                WARNING  min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is   train_util.py:876
                         set, because bucket reso is defined by image size automatically /
                         bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
                         算されるため、min_bucket_resoとmax_bucket_resoは無視されます
                INFO     number of images (including repeats) /                                    train_util.py:905
                         各bucketの画像枚数（繰り返し回数を含む）
                INFO     bucket 0: resolution (384, 768), count: 180                               train_util.py:910
                INFO     mean ar error (without repeats): 0.06233766233766236                      train_util.py:915
                INFO     preparing accelerator                                                  train_network.py:225

accelerator device: cpu INFO loading model for process 0/1 train_util.py:4385 INFO load StableDiffusion checkpoint: train_util.py:4341 F:/Kohya/kohya_ss/models/SD-v1.5-pruned.ckpt 2024-04-18 02:31:50 INFO UNet2DConditionModel: 64, 8, 768, False, False original_unet.py:1387 2024-04-18 02:31:55 INFO loading u-net: model_util.py:1009 2024-04-18 02:31:56 INFO loading vae: model_util.py:1017 2024-04-18 02:31:57 INFO loading text encoder: model_util.py:1074 2024-04-18 02:31:58 INFO Enable xformers for U-Net train_util.py:2660 Traceback (most recent call last): File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 2662, in replace_unet_modules import xformers.ops ModuleNotFoundError: No module named 'xformers'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "F:\Kohya\kohya_ss\sd-scripts\train_network.py", line 1115, in trainer.train(args) File "F:\Kohya\kohya_ss\sd-scripts\train_network.py", line 240, in train train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers, args.sdpa) File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 2664, in replace_unet_modules raise ImportError("No xformers / xformersがインストールされていないようです") ImportError: No xformers / xformersがインストールされていないようです Traceback (most recent call last): File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "F:\Kohya\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['F:\Kohya\kohya_ss\venv\Scripts\python.exe', 'F:/Kohya/kohya_ss/sd-scripts/train_network.py', '--config_file', './outputs/tmpfilelora.toml']' returned non-zero exit status 1. 02:32:01-537132 INFO Training has ended.

Hibiki82 commented 4 months ago

It always ends in: caching latents...

bmaltais commented 4 months ago

It is complaining of xformers module not being installed… this is probably the issue. Why is it not installing. Formers? No idea.

Hibiki82 commented 4 months ago

It is complaining of xformers module not being installed… this is probably the issue. Why is it not installing. Formers? No idea.

I followed the instruction and for some reason I couldn't get the same result as them. I'll try to manually install xFormers.

Hibiki82 commented 4 months ago

Still the same problem accrued.

uninstalled the existing files: pip uninstall torch torchvision xformers -y

reinstalling xFormers Torchvision and Pytorch: pip install torch==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118 pip install torchvision==0.16.2+cu118 --index-url https://download.pytorch.org/whl/cu118 pip install xformers==0.0.23.post1+cu118 --index-url https://download.pytorch.org/whl/cu118

accelerator device: cuda INFO loading model for process 0/1 train_util.py:4385 INFO load StableDiffusion checkpoint: train_util.py:4341 F:/Kohya/kohya_ss/models/SD-v1.5-pruned.ckpt 2024-04-19 03:02:51 INFO UNet2DConditionModel: 64, 8, 768, False, False original_unet.py:1387 2024-04-19 03:02:56 INFO loading u-net: model_util.py:1009 INFO loading vae: model_util.py:1017 2024-04-19 03:02:58 INFO loading text encoder: model_util.py:1074 INFO Enable xformers for U-Net train_util.py:2660 import network module: networks.lora 2024-04-19 03:02:59 INFO [Dataset 0] train_util.py:2079 INFO caching latents. train_util.py:974 INFO checking cache validity... train_util.py:984 100%|██████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<?, ?it/s] INFO caching latents... train_util.py:1021 0%| | 0/18 [00:02<?, ?it/s] Traceback (most recent call last): File "F:\Kohya\kohya_ss\sd-scripts\train_network.py", line 1115, in trainer.train(args) File "F:\Kohya\kohya_ss\sd-scripts\train_network.py", line 272, in train train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 2080, in cache_latents dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process) File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 1023, in cache_latents cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop) File "F:\Kohya\kohya_ss\sd-scripts\library\train_util.py", line 2428, in cache_batch_latents raise RuntimeError(f"NaN detected in latents: {info.absolute_path}") RuntimeError: NaN detected in latents: F:\Kohya\training_images\houkisei\image\10_houkisei__arlecchino_genshin_impact_drawn_by_houkisei__01f00b735e7d543cee6662af42e74343.jpg Traceback (most recent call last): File "C:\Users\Sky\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Sky\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "F:\Kohya\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "F:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['F:\Kohya\kohya_ss\venv\Scripts\python.exe', 'F:/Kohya/kohya_ss/sd-scripts/train_network.py', '--config_file', './outputs/tmpfilelora.toml']' returned non-zero exit status 1. 03:03:04-779926 INFO Training has ended.

bmaltais commented 4 months ago

Make sure to select no-half-vae... the NaN error for latents is beause youneed to set no-half-vae to true.

Hibiki82 commented 4 months ago

I don't see the no-half-vae option in the gui, is it in the python script?

bmaltais commented 4 months ago

Hummm... true... you are training an sd1.5 model... not sure why it return NaN then... mighht have to reach out to the sd-scripts issues page to ask there as this appear to be an sd-script issue...

Hibiki82 commented 4 months ago

I guess I'll have to create a issue on sd-scripts issues page. :(

Thanks for Troubleshooting my issue.

bmaltais commented 4 months ago

The good thing is that you will be able to provide the toml to kohya so he can better understand what is triggering the traceback and how he might trap it in his code.

Hibiki82 commented 4 months ago

For some reason is training right now, when I switch from 1.5 to XL. I'm waiting on the result.

nilupulmanodya commented 4 months ago

Same issue for me, when running with stabilityai/stable-diffusion-xl-base-1.0

INFO caching latents... train_util.py:1021 0%| | 0/18 [00:01<?, ?it/s] Traceback (most recent call last): File "F:\projects\LoRa\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in <module> trainer.train(args) File "F:\projects\LoRa\kohya_ss\sd-scripts\train_network.py", line 273, in train train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "F:\projects\LoRa\kohya_ss\sd-scripts\library\train_util.py", line 2080, in cache_latents dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process) File "F:\projects\LoRa\kohya_ss\sd-scripts\library\train_util.py", line 1023, in cache_latents cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop) File "F:\projects\LoRa\kohya_ss\sd-scripts\library\train_util.py", line 2428, in cache_batch_latents raise RuntimeError(f"NaN detected in latents: {info.absolute_path}")

nilupulmanodya commented 4 months ago

tried with stabilityai/stable-diffusion-2-base . it works..

bmaltais commented 4 months ago

Can you share the json file for the training?

fraprion commented 4 months ago

I have the same issue. When I create a training from scratch it works. When I load a config file I get the error. config.json

bmaltais / kohya_ss

ERROR Trying to train LORA Model on my PC, HELP!! #2311