AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
139.95k stars 26.51k forks source link

[Bug]: Training finished at 0 steps. #12031

Closed hansun11 closed 1 year ago

hansun11 commented 1 year ago

Is there an existing issue for this?

What happened?

when i want to train, it didn't work.The web UI show:Training finished at 0 steps. Embedding saved to E:\SDwebui\webui\embeddings\smartlock.pt ( it's a new PC ,and my GPU is 4060ti 16G )

Steps to reproduce the problem

  1. creat a new Embedding
  2. choose the image SET: Shuffle tags by ',' when creating prompts. Drop out tags when creating prompts 0.1 . deterministi Move VAE and CLIP to RAM when training if possible. Saves VRAM. Save textual inversion and hypernet settings to a text file whenever training starts. Use cross attention optimizations while training

What should have happened?

it should works as i did in Colab.

Version or Commit where the problem happens

version: v1.5.0

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Windows

What device are you running WebUI on?

Nvidia GPUs (RTX 20 above), Other GPUs

Cross attention optimization

Automatic

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

set COMMANDLINE_ARGS=--xformers

List of extensions

no

Console logs

Training at rate of 0.004 until step 30000
Preparing dataset...
100%|██████████████████████████████████████████████████████████████████████████████████| 21/21 [00:02<00:00,  9.14it/s]
  0%|                                                                                        | 0/30000 [00:00<?, ?it/s]*** Error training embedding
    Traceback (most recent call last):
      File "E:\SDwebui\webui\modules\textual_inversion\textual_inversion.py", line 529, in train_embedding
        scaler.scale(loss).backward()
      File "E:\SDwebui\system\python\lib\site-packages\torch\_tensor.py", line 487, in backward
        torch.autograd.backward(
      File "E:\SDwebui\system\python\lib\site-packages\torch\autograd\__init__.py", line 200, in backward
        Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 77, 768]], which is output 0 of MulBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

---

Additional information

No response

jbenton commented 1 year ago

Yep, training embeddings seems to have been broken in 1.5.0. I've gotten that same error —

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 77, 768]], which is output 0 of MulBackward0, is at version 2; expected version 1 instead.

— a half-dozen times in the past 24 hours across multiple A1111 installs, on both Colab (NoCrypt) and a different hosted A1111 (Stadio). Reverting to 1.4.0 fixes the problem.

hyppyhyppo commented 1 year ago

Same error when training TI with 1.5.0, I switch back to 1.4.1

hansun11 commented 1 year ago

是的,训练嵌入似乎在 1.5.0 中被破坏了。我也遇到了同样的错误 -

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 77, 768]], which is output 0 of MulBackward0, is at version 2; expected version 1 instead.

— 在过去 24 小时内,在 Colab (NoCrypt) 和不同托管的 A1111 (Stadio) 上安装了多个 A1111,有六次。恢复到 1.4.0 可以解决该问题。

Unfortunately, it still doesn't work with the same log after I switch back to 1.4

hansun11 commented 1 year ago

this bug is fixed in the latest RC version with commit https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/5c8f91b22975701af22d24f947af82e7d23264d5

thomashooo commented 1 year ago

this bug is fixed in the latest RC version with commit 5c8f91b

太棒了,折腾了一天,解决问题!!!