[Bug]: ValueError: cannot convert float NaN to integer when using embeddings

billium99 commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

After training an embedding pt file successfully, and placing it in the embeddings directory, I now get "ValueError: cannot convert float NaN to integer" when attempting to generate an image using my embedding word in the prompt.

This is an M1 Mac Studio with 32gb of RAM

Steps to reproduce the problem

Train on a folder of images (in my case there were 50 images) Here are the settings I used to train: { "datetime": "2023-01-16 23:25:01", "model_name": "v2-1_768-ema-pruned", "model_hash": "ad2a33c361", "num_of_dataset_images": 50, "num_vectors_per_token": 15, "embedding_name": "Robryde23", "learn_rate": "0.05:10,0.02:20,0.01:60,0.005:200,0.002:500,0.001:3000,0.0005", "batch_size": 22, "gradient_step": 2, "data_root": "/Volumes/LaCie/Robyns Wedding/Robyn Trainer Shots/New-Resized", "log_directory": "textual_inversion/2023-01-16/Robryde23", "training_width": 512, "training_height": 512, "steps": 3000, "clip_grad_mode": "disabled", "clip_grad_value": "0.1", "latent_sampling_method": "deterministic", "create_image_every": 50, "save_embedding_every": 50, "save_image_with_stored_embedding": true, "template_file": "/Users/williamhenderson/stable-diffusion-webui/textual_inversion_templates/custom_subject_filewords.txt", "initial_step": 5 }
Ran that and got a successfully trained message
Move the pt file to the stable-diffusion-web-ui/embeddings folder
Create a new prompt for a new image in txt to image using my "trigger" word for the embedding
Fails to create an image and returns the error: "ValueError: cannot convert float NaN to integer"
Create an image without the embedding prompt text and SD seems to be fully functional.

What should have happened?

I believe it should have rendered a new image using my training data because of the embedding trigger prompt text.

Commit where the problem happens

ff6a5bcec1ce25aa8f08b157ea957d764be23d8d

What platforms do you use to access UI ?

MacOS

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

No, none

Additional information, context and logs

Log output from my failed image creation:

Textual inversion embeddings loaded(1): Robryde23-50 100%|███████████████████████████████████████████| 20/20 [00:49<00:00, 2.47s/it] Total progress: 100%|███████████████████████████| 20/20 [00:34<00:00, 1.74s/it] 100%|███████████████████████████████████████████| 20/20 [00:24<00:00, 1.22s/it] Total progress: 100%|███████████████████████████| 20/20 [00:19<00:00, 1.02it/s] Error completing request Arguments: ('task(stl5bm10q0zly1e)', 'a photograph of Robryde23-50', 'ugly, distorted face', 'None', 'None', 20, 0, False, False, 1, 1, 18, -1.0, -1.0, 0, 0, 0, False, 512, 768, False, 0.7, 2, 'Latent', 0, 0, 0, 0, False, False, False, False, '', 1, '', 0, '', True, False, False) {} Traceback (most recent call last): File "/Users/williamhenderson/stable-diffusion-webui/modules/call_queue.py", line 56, in f res = list(func(*args, kwargs)) File "/Users/williamhenderson/stable-diffusion-webui/modules/call_queue.py", line 37, in f res = func(*args, *kwargs) File "/Users/williamhenderson/stable-diffusion-webui/modules/txt2img.py", line 52, in txt2img processed = process_images(p) File "/Users/williamhenderson/stable-diffusion-webui/modules/processing.py", line 479, in process_images res = process_images_inner(p) File "/Users/williamhenderson/stable-diffusion-webui/modules/processing.py", line 598, in process_images_inner c = get_conds_with_caching(prompt_parser.get_multicond_learned_conditioning, prompts, p.steps, cached_c) File "/Users/williamhenderson/stable-diffusion-webui/modules/processing.py", line 565, in get_conds_with_caching cache[1] = function(shared.sd_model, required_prompts, steps) File "/Users/williamhenderson/stable-diffusion-webui/modules/prompt_parser.py", line 205, in get_multicond_learned_conditioning learned_conditioning = get_learned_conditioning(model, prompt_flat_list, steps) File "/Users/williamhenderson/stable-diffusion-webui/modules/prompt_parser.py", line 140, in get_learned_conditioning conds = model.get_learned_conditioning(texts) File "/Users/williamhenderson/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 669, in get_learned_conditioning c = self.cond_stage_model(c) File "/Users/williamhenderson/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/Users/williamhenderson/stable-diffusion-webui/modules/sd_hijack_clip.py", line 233, in forward embeddings_list = ", ".join([f'{name} [{embedding.checksum()}]' for name, embedding in used_embeddings.items()]) File "/Users/williamhenderson/stable-diffusion-webui/modules/sd_hijack_clip.py", line 233, in embeddings_list = ", ".join([f'{name} [{embedding.checksum()}]' for name, embedding in used_embeddings.items()]) File "/Users/williamhenderson/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 83, in checksum self.cached_checksum = f'{const_hash(self.vec.reshape(-1) 100) & 0xffff:04x}' File "/Users/williamhenderson/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 80, in const_hash r = (r 281 ^ int(v) * 997) & 0xFFFFFFFF ValueError: cannot convert float NaN to integer

FurkanGozukara commented 1 year ago

Newest update training is also broken

vladmandic commented 1 year ago

most likely training process itself resulted in NaN values (like what happens when you choose extremely high learn rate), etc. yes, webui should display better error messages when that happens.

check the training log to confirm.

FurkanGozukara commented 1 year ago

I started using commit version d8f8bcb821fa62e943eb95ee05b8a949317326fe for now it works good

popcornkiller1088 commented 1 year ago

I have the same issue, any idea

billium99 commented 1 year ago

Vladmandic I think was correct in my case. Automatic continiues to work flawlessly except for my trainings. I can even deploy others' trainings and use them just fine, but something about my process creating my own pt file remains my problem. I didn't resolve it, but also haven't had time to try to start from scratch again. The first thing I'm going to try is fewer images. I was using 25, which then doubled with the mirror versions. I think 50 might be causing memory issues or something else I missed. Gonna watch the logs more closely as well.

onexdata commented 1 year ago

I had this problem as well, tried changing the prompts, changing the images, reducing the images, changing the epochs, learning rates, you name it, I always got the bug again and again, right before displaying a rendering of current progress, after the loss started resulting in "NaN".

Then I realized I was training on a model OTHER THAN the stable diffusion default. i.e. when I switched back to 1.5 pruned ema only, everything worked, again and again.

I went back to training on another model (tried the deliberate 2 model), and the NaN bug showed up again.

In short, if you're having this issue, check which model you have loaded, it's a likely cause.

maplethorpej commented 1 year ago

I had this problem as well, tried changing the prompts, changing the images, reducing the images, changing the epochs, learning rates, you name it, I always got the bug again and again, right before displaying a rendering of current progress, after the loss started resulting in "NaN".

Then I realized I was training on a model OTHER THAN the stable diffusion default. i.e. when I switched back to 1.5 pruned ema only, everything worked, again and again.

I went back to training on another model (tried the deliberate 2 model), and the NaN bug showed up again.

In short, if you're having this issue, check which model you have loaded, it's a likely cause.

This fixed it for me! So easy to overlook. Thanks

AUTOMATIC1111 / stable-diffusion-webui