AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
139.66k stars 26.47k forks source link

Not sure if it's a bug - textual inversion: using subject_filewords.txt instead of style_filewords.txt doesn't generate likeness at all #1579

Closed barleyj21 closed 1 year ago

barleyj21 commented 1 year ago

Not sure if it's a bug, but training for 8000 steps with "subject_filewords.txt" didn't produce anything remotely connected with training data (using 6 imgs dataset), while using default "style_filewords.txt" prompt template file was generating likeness since first 500 steps (using 7 imgs dataset). Both datasets were preprocessed by Automatic1111 with "flip" flag on before run.

Do I need to train more, or use larger dataset for subject?

H1M4W4R1 commented 1 year ago

Note: subject == object.

It works quite well. I tried it with several different options (by purging embedding each time). Best one was with full dataset (88 face images... EDIT: multiplied by 2 due to flip actually could get more, but got too bored with cutting images) - 5000 steps, 0.05 learning rate, 6 vectors.

Images gets about 80-90% accuracy which I can quickly fix using inpaint to get something like 95-99%.

How-to The most important thing is the amount of vectors. I recommend around 6-12 for good accuracy and semi-good editability. Going less will require way more steps (it increases editability, but decreases accuracy). Going more is reverse of previous rule - less editability, more accuracy.

The second important thing: your images in dataset must be similar to ones in your model. If you go with anime images and real-world model your embedding will easily get f-d up.

For way better results go to 25k steps, it's quite overkill, depends on image complexity, which affects SD quite heavily. Eg. outfit: Teaching single outfit element on cut images (only this element and part of body is visible) gets good results in 5-10k 6V, when entire outfit will need sth like 100-500k 24V. It's better to cut outfit into specific elements and then prompt them together. Also ignore things like eye color, hair color in your filewords. They usually do more harm than help. I commonly don't even use filewords at all.

Sometimes it's better to define objects as eg. 'girl wearing '. This gives context to AI and it knows that element is part of clothing, so it searches it on characters and leaves environment alone. Also it gives a problem later during prompting, but it's easier to learn how to prompt than wait for good embedding.

Technical things Teaching style is easier, because it's for "likeness" of entire image, but object/subject is way harder, because your AI needs to distinguish it from other objects. Things gets even harder when two objects have same/heavily similar name and different shape / location. Eg. you can wear scarf on your neck and scarf on your mouth (during winter). My model has already first option defined, but second one was not... Even 100k steps at my usual settings were too low...

Less vectors require more steps (to get accurate results) because you've way more possible outputs with same number of inputs. Thus you also need more images. Reverse for higher-count vectors.

Shortcut If you want similar image go for style. If you want specific element on image (clothing, jewellery etc.) go for subject. I also recommend editing templates to give context of "what it is" (clothing, enviro. etc.).

I recommend small vector counts for describing generics like character's sex, furniture name etc. In this case you need editability and sometimes creativity of AI is useful. For very specific things like eg. specific model of GPU go with higher vectors count (cuz you've less images of eg. 'Tesla K20X' and it's more specific than 'girl').

Appendix I work with Waifu model on RTX 2070 Super, Ryzen 9 3900X, 64GB DDR4 3600MHz. Sometimes I also turn my K20X on when I need some more power, but K20X has quite low bandwidth, so it's mostly when I need to run some engineering software while training AI.

barleyj21 commented 1 year ago

Thank you! You've explained this really well! I've been training everything I've mentioned in this issue with 10 vectors, but only gave it a couple of attempts so far. Now that, thanks to you, i'm sure it's working, i'm going to try with more training data and more steps, vectors, etc

There are historical figures that aren't looking alike to how they are depicted in photos and drawings (so it seems that their dataset didn't have access to representations i'm used to) with default stable diffusion model. And I've managed to greatly improve results with just 5000 steps using default style_filewords.txt and 6 images set, so now i want to do some extensive testing with one subject and various approaches.