bmaltais / kohya_ss

Apache License 2.0
9.6k stars 1.24k forks source link

SDXL Training Error - RuntimeError: NaN detected in latents #2674

Open PENGUINADELIE opened 3 months ago

PENGUINADELIE commented 3 months ago

I am trying to create an SDXL LoRA using Runpod. My dataset consists of 25 images of women, each with a size of 1024x1024 pixels. I keep encountering error logs indicating issues with the images. All images are 1024x1024 pixels, and I've tried using both PNG and JPG formats, but the issue persists. Does anyone know how to fix this?

[Error log] File "/workspace/kohya_ss/sd-scripts/sdxl_train_network.py", line 185, in trainer.train(args) File "/workspace/kohya_ss/sd-scripts/train_network.py", line 272, in train train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "/workspace/kohya_ss/sd-scripts/library/train_util.py", line 2324, in cache_latents dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process, file_suffix) File "/workspace/kohya_ss/sd-scripts/library/train_util.py", line 1146, in cache_latents cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.alpha_mask, subset.random_crop) File "/workspace/kohya_ss/sd-scripts/library/train_util.py", line 2772, in cache_batch_latents raise RuntimeError(f"NaN detected in latents: {info.absolute_path}") RuntimeError: NaN detected in latents: /workspace/data/img_xyzminji/40xyzminji woman/xyzminji(1).jpg Traceback (most recent call last): File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in sys.exit(main()) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/workspace/kohya_ss/venv/bin/python', '/workspace/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', '/workspace/data/model_lora/config_lora-20240729-035252.toml']' returned non-zero exit status 1. 03:53:26-826393 INFO Training has ended.

5KilosOfCheese commented 3 months ago

Could you post your settings, the config file so we can see if there is something wrong there.

Also things you should check:

PENGUINADELIE commented 3 months ago

Could you post your settings, the config file so we can see if there is something wrong there.

Also things you should check:

  • Make sure there are no images with same file name, as in check that there is no Image.jpg and Image.png present. Otherwise there will be a conflict.

Thank you so much for answering my question. I've put together an image of how I prepared the data and what I clicked to get this result. Do you have any idea what might be causing this? I'd be very grateful for an answer.

PENGUINADELIE commented 3 months ago

I am trying to create an SDXL LoRA using Runpod. My dataset consists of 25 images of women, each with a size of 1024x1024 pixels. I keep encountering error logs indicating issues with the images. All images are 1024x1024 pixels, and I've tried using both PNG and JPG formats, but the issue persists. Does anyone know how to fix this?

[Error log] File "/workspace/kohya_ss/sd-scripts/sdxl_train_network.py", line 185, in trainer.train(args) File "/workspace/kohya_ss/sd-scripts/train_network.py", line 272, in train train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "/workspace/kohya_ss/sd-scripts/library/train_util.py", line 2324, in cache_latents dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process, file_suffix) File "/workspace/kohya_ss/sd-scripts/library/train_util.py", line 1146, in cache_latents cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.alpha_mask, subset.random_crop) File "/workspace/kohya_ss/sd-scripts/library/train_util.py", line 2772, in cache_batch_latents raise RuntimeError(f"NaN detected in latents: {info.absolute_path}") RuntimeError: NaN detected in latents: /workspace/data/img_xyzminji/40xyzminji woman/xyzminji(1).jpg Traceback (most recent call last): File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in sys.exit(main()) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/workspace/kohya_ss/venv/bin/python', '/workspace/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', '/workspace/data/model_lora/config_lora-20240729-035252.toml']' returned non-zero exit status 1. 03:53:26-826393 INFO Training has ended.

01 02 03 04 05 06 07 08 09 10 11 12

5KilosOfCheese commented 3 months ago

Disable the image augmentations (Crop, colour...), you can't use those while caching latents (except flip). Or disable caching of latents.

b-fission commented 3 months ago

@PENGUINADELIE

You'll need to enable the No half VAE checkbox if you're getting the NaN detected error on SDXL.

PENGUINADELIE commented 3 months ago

Disable the image augmentations (Crop, colour...), you can't use those while caching latents (except flip). Or disable caching of latents.

Disable the image augmentations (Crop, colour...), you can't use those while caching latents (except flip). Or disable caching of latents.

Thank you so much for your reply. You helped me solve the problem!

PENGUINADELIE commented 3 months ago

@PENGUINADELIE

You'll need to enable the No half VAE checkbox if you're getting the NaN detected error on SDXL.

The no half vae check worked well for me, thank you so much for your answer.

choowkee commented 1 month ago

@PENGUINADELIE

You'll need to enable the No half VAE checkbox if you're getting the NaN detected error on SDXL.

Helped me as well. SDXL 1.0 base model was giving me errors when training a LORA but this solved it. Thanks!