AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
141.6k stars 26.75k forks source link

[Bug]: Variable learning rate results in NaN errors #5936

Closed vladmandic closed 1 year ago

vladmandic commented 1 year ago

Is there an existing issue for this?

What happened?

PR #1795 added support for variable learning rate, but when i try to use it to embedding training,
it does get identified correctly and each change is printed on console, but it still always results in NaN errors:

for example: learn_rate = "0.005:50, 0.001:100, 0.0005:500"

it always ends up with NaN error at step 500 (first "checkpoint") regardless of any values i may put in

[Epoch 12: 20/40] loss: nan:  25% | 499/2000 [08:39<25:13,  1.01s/it]
Traceback (most recent call last):
  File "/home/vlado/branches/automatic/modules/textual_inversion/textual_inversion.py", line 397, in train_embedding
    processed = processing.process_images(p)
  File "/home/vlado/branches/automatic/modules/processing.py", line 464, in process_images
    res = process_images_inner(p)
  File "/home/vlado/branches/automatic/modules/processing.py", line 557, in process_images_inner
    c = prompt_parser.get_multicond_learned_conditioning(shared.sd_model, prompts, p.steps)
  File "/home/vlado/branches/automatic/modules/prompt_parser.py", line 203, in get_multicond_learned_conditioning
    learned_conditioning = get_learned_conditioning(model, prompt_flat_list, steps)
  File "/home/vlado/branches/automatic/modules/prompt_parser.py", line 138, in get_learned_conditioning
    conds = model.get_learned_conditioning(texts)
  File "/home/vlado/branches/automatic/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 669, in get_learned_conditioning
    c = self.cond_stage_model(c)
  File "/home/vlado/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/vlado/branches/automatic/modules/sd_hijack_clip.py", line 184, in forward
    batch_multipliers, remade_batch_tokens, used_custom_terms, hijack_comments, hijack_fixes, token_count = self.process_text(text)
  File "/home/vlado/branches/automatic/modules/sd_hijack_clip.py", line 102, in process_text
    remade_tokens, fixes, multipliers, current_token_count = self.tokenize_line(line, used_custom_terms, hijack_comments)
  File "/home/vlado/branches/automatic/modules/sd_hijack_clip.py", line 77, in tokenize_line
    used_custom_terms.append((embedding.name, embedding.checksum()))
  File "/home/vlado/branches/automatic/modules/textual_inversion/textual_inversion.py", line 52, in checksum
    self.cached_checksum = f'{const_hash(self.vec.reshape(-1) * 100) & 0xffff:04x}'
  File "/home/vlado/branches/automatic/modules/textual_inversion/textual_inversion.py", line 49, in const_hash
    r = (r * 281 ^ int(v) * 997) & 0xFFFFFFFF
ValueError: cannot convert float NaN to integer

using fixed learning rate like learn_rate = "0.005" works without issues

Steps to reproduce the problem

  1. Go to Train -> Train -> Train Embedding
  2. Change Embedding Learning Rate field to anything that includes variable rate
  3. Press Train Embedding
  4. Wait until step 500

What should have happened?

Training should complete without errors

Commit where the problem happens

685f9631b56ff8bd43bce24ff5ce0f9a0e9af490

What platforms do you use to access UI ?

Windows, Linux

What browsers do you use to access the UI ?

Google Chrome, Microsoft Edge

Command Line Arguments

No response

Additional information, context and logs

No response

vladmandic commented 1 year ago

something is not right in my analysis leading to error, i'll refile