AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
139.17k stars 26.41k forks source link

[Bug]: ValueError: cannot convert float NaN to integer #7162

Open NaughtDZ opened 1 year ago

NaughtDZ commented 1 year ago

Is there an existing issue for this?

What happened?

Training Embedding at such 999or2999 step,will show up:

raceback (most recent call last): File "I:\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 479, in train_embedding c = shared.sd_model.cond_stage_model(batch.cond_text) File "I:\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "I:\stable-diffusion-webui\modules\sd_hijack_clip.py", line 233, in forward embeddings_list = ", ".join([f'{name} [{embedding.checksum()}]' for name, embedding in used_embeddings.items()]) File "I:\stable-diffusion-webui\modules\sd_hijack_clip.py", line 233, in embeddings_list = ", ".join([f'{name} [{embedding.checksum()}]' for name, embedding in used_embeddings.items()]) File "I:\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 84, in checksum self.cached_checksum = f'{const_hash(self.vec.reshape(-1) 100) & 0xffff:04x}' File "I:\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 81, in const_hash r = (r 281 ^ int(v) 997) & 0xFFFFFFFF ValueError: cannot convert float NaN to integer

Steps to reproduce the problem

  1. Go to Embedding training
  2. training some steps
  3. erro

What should have happened?

Embedding training works fine

Commit where the problem happens

602a1864b05075ca4283986e6f5c7d5bce864e11

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

No response

Additional information, context and logs

No response

thezveroboy commented 1 year ago

got the same in step 149 and then in step 449

NaughtDZ commented 1 year ago

After comparing https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/602a1864b05075ca4283986e6f5c7d5bce864e11 and https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/d8f8bcb821fa62e943eb95ee05b8a949317326fe (The latter can be used normally without error.) The files in error report: \modules\sd_hijack_clip.py" The two versions are identical \venv\lib\site-packages\torch\nn\modules\module.py Large number of inconsistencies \modules\textual_inversion\textual_inversion.py Large number of inconsistencies

So...The wrong problem may come from the version change of torch or textual_inversion?

popcornkiller1088 commented 1 year ago

got the same issue in step 199 tho

KhoaVo commented 1 year ago

Got the same value at step 199 as well

kareem613 commented 1 year ago

Same build 0cc0ee1bcb4c24a8c9715f66cede06601bfc00c8 installed on a ubuntu (focal) machine and on windows 11. I get this error only on the linux machine. Using 3000 steps and this embedded learning rate 0.05:10, 0.02:20, 0.01:60, 0.005:200, 0.002:500, 0.001:3000, 0.0005 it only errors when it gets to 0.001:3000.

kareem613 commented 1 year ago

Extra data point. I moved a 3070 from the linux machine to the Windows machine, and I now get the error on the Windows machine.

seanburles commented 1 year ago

I think I may have found a solution for that problem on my Mac m1, I kept getting that error at step 499, I looked at the "Save an image to log directory every N steps, 0 to disable" and the "Save a copy of embedding to log directory every N steps, 0 to disable" they were both set at 500. I switched them to above the maximum steps so it wouldnt trigger and so far so good. Still training but past the normal error point