Open f-rank opened 1 year ago
Turning off "Move VAE and CLIP to RAM when training if possible. Saves VRAM." in Settings>Training made it run once but on the second run and subsequent tries it went back to throwing the same errors.
Setting both "Save an image to log directory every N steps, 0 to disable" and "Save a copy of embedding to log directory every N steps, 0 to disable" to 1, actually runs. But this is unusable. This is very counter intuitive as the 100 steps I had in there before didn't make it to the image generation part. So, I'm lost on this.
Doesn't seem to be a ram issue, 12gb of ram here and I get the same error.
Ah, it's an old bug. Check your prompt template txt, it does need [name]
in every line.
Ah, it's an old bug. Check your prompt template txt, it does need
[name]
in every line.
Thank you Heathen. I also experienced the same issue just today and that fixed it. Didn't know [name] was required for embeddings. I thought you could just use another token and it would work as long as you had the filename in the txt2img prompt, but apparently not.
Ah, it's an old bug. Check your prompt template txt, it does need
[name]
in every line.
I did, none.txt didn't have it and I added it in. Thing is I wasn't even using none.txt and the problem continues. I am able to start a training going if I do Save image and Save copy of embedding set to 1. At around 100'ish I am able to interrupt and change those numbers and then Train again, then it goes without throwing.
I have the same problem, however style_filewords.txt
already contains [name]
on every line and setting both the img and embed save settings to 1 threw the same assertion. Does anyone have insight into what could be causing this? I'm not sure what [name]
in every line of the template file and embed save settings have to do with being unable to "record inf checks for this optimizer".
I have the same problem, however
style_filewords.txt
already contains[name]
on every line and setting both the img and embed save settings to 1 threw the same assertion. Does anyone have insight into what could be causing this? I'm not sure what[name]
in every line of the template file and embed save settings have to do with being unable to "record inf checks for this optimizer".
Hi @ljleb, please go through your style_filewords.txt file and all the files in that directory and delete the trailing extra line at the end of each file. Doing this seems to have fixed the problem on my side. I hope it helps you.
please go through all the files in that directory and delete the trailing extra line at the end of each file.
Thanks for the suggestion. I tried to implement it but it didn't fix or change the assertion error in any way unfortunately. Something else I tried was to comment the assertion but another similar assertion is then triggered in another part of the code, which makes me believe something must be wrong with the input information to the asserting code.
Had hopes it worked for your problem too, apparently it's not related to trailing spaces in your case.
I finally fixed the problem by downgrading from torch 1.13 to torch 1.12.x and not passing --reinstall-torch
as advertised in stdout when starting the webui. Not sure why using the intended torch version breaks training. I'm using a rtx3080 10gb vram, 32gb ram, intel i5-12600kf.
Edit: to be completely transparent on the torch version that worked for me, here's the output of pip show torch
:
PS D:\src\stable-diffusion-webui> .\venv\Scripts\python.exe -m pip show torch
Name: torch
Version: 1.12.1+cu113
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: d:\src\stable-diffusion-webui\venv\lib\site-packages
Requires: typing-extensions
Required-by: accelerate, basicsr, clean-fid, clip, facexlib, gfpgan, kornia, lpips, open-clip-torch, pytorch-lightning, realesrgan, timm, torchaudio, torchdiffeq, torchmetrics, torchsde, torchvision
Is there an existing issue for this?
What happened?
Setup the training just like usual, starts to prepare dataset then throws:
" File "D:\WORK\conda_envs\automatic\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 498, in train_embedding scaler.step(optimizer) File "d:\WORK\conda_envs\automatic\stable-diffusion-webui\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 336, in step assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer." AssertionError: No inf checks were recorded for this optimizer.
Applying xformers cross attention optimization. "
The web ui states: "Training finished at 0 steps."
Steps to reproduce the problem
Create an embedding and go to the Train tab, fill in necessary info. Click Train Embedding.
What should have happened?
It should have started the training process.
Commit where the problem happens
82725f0ac439f7e3b67858d55900e95330bbd326
What platforms do you use to access UI ?
Windows
What browsers do you use to access the UI ?
Google Chrome
Command Line Arguments
Additional information, context and logs