[Bug]: Textual Inversion training fails on 8 GB vram when it used to work just fine on the same machine

f-rank commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

Setup the training just like usual, starts to prepare dataset then throws:

" File "D:\WORK\conda_envs\automatic\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 498, in train_embedding scaler.step(optimizer) File "d:\WORK\conda_envs\automatic\stable-diffusion-webui\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 336, in step assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer." AssertionError: No inf checks were recorded for this optimizer.

Applying xformers cross attention optimization. "

The web ui states: "Training finished at 0 steps."

Steps to reproduce the problem

Create an embedding and go to the Train tab, fill in necessary info. Click Train Embedding.

What should have happened?

It should have started the training process.

Commit where the problem happens

82725f0ac439f7e3b67858d55900e95330bbd326

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--xformers --opt-channelslast

Additional information, context and logs

f-rank commented 1 year ago

Turning off "Move VAE and CLIP to RAM when training if possible. Saves VRAM." in Settings>Training made it run once but on the second run and subsequent tries it went back to throwing the same errors.

f-rank commented 1 year ago

Setting both "Save an image to log directory every N steps, 0 to disable" and "Save a copy of embedding to log directory every N steps, 0 to disable" to 1, actually runs. But this is unusable. This is very counter intuitive as the 100 steps I had in there before didn't make it to the image generation part. So, I'm lost on this.

Heathen commented 1 year ago

Doesn't seem to be a ram issue, 12gb of ram here and I get the same error.

Heathen commented 1 year ago

Ah, it's an old bug. Check your prompt template txt, it does need [name] in every line.

trashcand69 commented 1 year ago

Ah, it's an old bug. Check your prompt template txt, it does need [name] in every line.

Thank you Heathen. I also experienced the same issue just today and that fixed it. Didn't know [name] was required for embeddings. I thought you could just use another token and it would work as long as you had the filename in the txt2img prompt, but apparently not.

f-rank commented 1 year ago

Ah, it's an old bug. Check your prompt template txt, it does need [name] in every line.

I did, none.txt didn't have it and I added it in. Thing is I wasn't even using none.txt and the problem continues. I am able to start a training going if I do Save image and Save copy of embedding set to 1. At around 100'ish I am able to interrupt and change those numbers and then Train again, then it goes without throwing.

ljleb commented 1 year ago

I have the same problem, however style_filewords.txt already contains [name] on every line and setting both the img and embed save settings to 1 threw the same assertion. Does anyone have insight into what could be causing this? I'm not sure what [name] in every line of the template file and embed save settings have to do with being unable to "record inf checks for this optimizer".

f-rank commented 1 year ago

I have the same problem, however style_filewords.txt already contains [name] on every line and setting both the img and embed save settings to 1 threw the same assertion. Does anyone have insight into what could be causing this? I'm not sure what [name] in every line of the template file and embed save settings have to do with being unable to "record inf checks for this optimizer".

Hi @ljleb, please go through your style_filewords.txt file and all the files in that directory and delete the trailing extra line at the end of each file. Doing this seems to have fixed the problem on my side. I hope it helps you.

ljleb commented 1 year ago

please go through all the files in that directory and delete the trailing extra line at the end of each file.

Thanks for the suggestion. I tried to implement it but it didn't fix or change the assertion error in any way unfortunately. Something else I tried was to comment the assertion but another similar assertion is then triggered in another part of the code, which makes me believe something must be wrong with the input information to the asserting code.

f-rank commented 1 year ago

Had hopes it worked for your problem too, apparently it's not related to trailing spaces in your case.

ljleb commented 1 year ago

I finally fixed the problem by downgrading from torch 1.13 to torch 1.12.x and not passing --reinstall-torch as advertised in stdout when starting the webui. Not sure why using the intended torch version breaks training. I'm using a rtx3080 10gb vram, 32gb ram, intel i5-12600kf.

Edit: to be completely transparent on the torch version that worked for me, here's the output of pip show torch:

PS D:\src\stable-diffusion-webui> .\venv\Scripts\python.exe -m pip show torch
Name: torch
Version: 1.12.1+cu113                                                              
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/                                                    
Author: PyTorch Team                                                               
Author-email: packages@pytorch.org                                                 
License: BSD-3                                                                     
Location: d:\src\stable-diffusion-webui\venv\lib\site-packages                     
Requires: typing-extensions
Required-by: accelerate, basicsr, clean-fid, clip, facexlib, gfpgan, kornia, lpips, open-clip-torch, pytorch-lightning, realesrgan, timm, torchaudio, torchdiffeq, torchmetrics, torchsde, torchvision

AUTOMATIC1111 / stable-diffusion-webui