CompVis / stable-diffusion

A latent text-to-image diffusion model
https://ommer-lab.com/research/latent-diffusion-models/
Other
68.27k stars 10.16k forks source link

Documentation request: Text prompt primer #222

Open MrDrMcCoy opened 2 years ago

MrDrMcCoy commented 2 years ago

One of the great difficulties of new users trying to coax AI image generators into producing something like what they imagine is the construction of the text prompt. Users are often told that they can just tell it things they want to see and it will do it. In my experience, many of the phrases I put into the prompts are either ignored or misunderstood. I suspect this is partially my own fault, and the situation would be improved with a bit of documentation.

What I'm looking for is a document that details the following:

  1. What phrases are understood for artistic styling? For example, would it understand things like pixel art, line drawing, comic book, pulp art, cad model, salvador dali, or solarpunk?
  2. What phrases are understood for characters and objects? For example, would it understand things like garden gnome, maelstrom, mineral vein, power armor, coat of arms, or soldering iron?
  3. What phrases are understood for verbs and modifiers? For example, would it understand things like opening, fallow, holding, jaundiced, ugly, angry, vibrating, dutch angle, or defenestrating?
  4. What phrases are understood for image output settings? For example, would it understand things like 16:9, UHD, or 5-bit color?
  5. Is there any significance to grammar or ordering of phrases?
  6. What are the practical limits of how many and how specific one's phrases might be?
  7. Are there any hidden modifier phrases that the processing engine watches for?
  8. What happens when you repeat phrases? For example, woman shining a flashlight in an alley, but the flashlight shines darkness instead of light.
  9. What grammar or phrases will be ignored by the processing engine?
  10. Are there any grammatical patterns that tend to lead to better results?
  11. Other tips and tricks for how to talk to the machine.

As with all current iterations of natural language processing, the engine's ability to interpret what we write will be significantly reduced from what humans can do. Therefore, humans need to know the boundaries of what the system can interpret so that we can talk to the machine in terms it will understand. Hopefully a document that details these things will be able to improve the usability, quality, and utility of tools like this.

johnbcliff commented 1 year ago

I would love to know about punctuation and syntax. And their absence. We know that ( ) [ ] and : are used, but how important is a comma? We can obviously experiment with that. How about periods? I see things like f 1 5 for f 1.5, 200 mm vs 200mm, and so forth, are they parsed the way users think they are? For compound word names like "garden gnome" we can look at word lists, test it against CLIP, etc but it would be nice to have a bit more clarity. My big question today is "does an ampersand work as well as 'and' for things specified that way?" Is the character '&' meaningful? I can test (and am) but even a couple of character differences in a prompt can change the output, it doesn't tell me how it was parsed. Interrogate just confuses me : )

bezuklada commented 1 year ago

Hi! Have you received answers to these questions?