If I’m trying to generate images of a scene, it produces the incorrect number of people, messes up their appearance, duplicates individuals, doesn’t include finer details like certain objects.
Note: It’s easier to fix aspects of the setting (provided you don’t need it to be highly detailed), less easy/impossible to fix characters. Adjusting skin tone to unconventional colors like grey is very difficult.)
Note: You can request GPT to help you modify the prompt to emphasise or fix certain parts of the image but there doesn’t seem to be a way to make sure that Dall E will keep other aspects of the image the same. But usually one thing being changed can mean that things you want to stay the same will also be affected. Or adding more details will confuse Dall E. Certain specific details (woman carrying older woman) is not always present in image.
This whole process has taken me over an hour ⇒ not a good turnaround time when I need to write!
It’s time consuming to move back and forth between GPT 4 which I am asking to edit my prompt to make specific changes to the image and Dall E 4.
Positives
I was able to copy paste the text (which includes dialogue, narration, description) from my story and use this prompt with GPT to get a prompt for Dall E: Convert the following scene from a book I'm writing into a text description for Dall E.
Generating the image prompt with GPT is faster than writing it completely from scratch and sometimes the edits I request it to make, result in an image closer to what I want.
Some of the details in these images are useful for inspiration - the backdrop or a particular character’s appearance - but if I already had specific ideas in mind for how I wanted things and people to look, it won’t be useful.
Recap on using Dall E 3 for generating complex scene
Can Dall E handle multiple people and get this correct. No.
Can Dall E manage multiple details in the scene and render it accurately (and the extent to which this is possible). No.Dall E 3 has trouble keeping track of people - there are often 5 individuals instead of 4.
It also gets confused about clothing and ethnicity even if these are specified in the prompt. If the ethnicity is specified as Indian, it’ll generate people wearing traditional Indian attire (saree, lungi, etc.) or non-Western clothing so it leans towards stereotypes and assumptions.
It isn’t able to keep track of details of objects like the gold tablet.
The background is the most useful part of the image and the one character in the forefront of the image is the clearest and rendered the best.
If you ask it to make the image more fantastical, it will skip some other detail like rendering the correct sword tattoo instead of random tattoos. More details ⇒ more confusion.
It’s easier to work with this if you have a more vague idea of your character’s appearance and are willing to alter your original vision - if you want a very faithful depiction, you won’t get it!
Recap on using SDXL with GPT-4 and Claude for generating complex scene
The results were comparably bad since SDXL (in Poe’s interface) only accepts short prompts so I couldn’t add the necessary detail. SDXL also missed many of the details that were included in the short prompt. Resulting images were too far off from what I had in my mind so SDXL technically performed even worse than Dall E 3 (although I wasn’t inclined to use the content produced by Dall E 3).
Recap on using Dall E 3 vs SDXL with GPT-4 for generating image of setting:
Pretty satisfied with the results for generating setting images (with no people specified)! This is definitely better than making images of characters and equivalent to making images of objects I think.
Would be helpful if the UI lets the writer view multiple variations of the image because the first or second one isn’t necessarily the best one even if they have the same prompt!
For SDXL - need to be emphasise that the image needs to be high quality and even then, I didn’t think the images were sufficiently detailed and the depiction of the interior didn’t quite match what I had in mind.
Dall E 3 had far better results for generating setting images compared to SDXL – the quality, the level of detail, the usefulness, etc.
Definitely better to use more specific prompts and to develop these using the LLM before putting them into Dall E.
Can make diverse characters using Dall E – i.e. they have scars, etc. It does help to ask GPT-4 to alter the prompt to emphasize that the figure should be a woman if I’m trying to depict a fisherwoman or pirate - some of the images that were generated initially depicted men.
Problem for making a storyboard: There will be no consistency across images of character - this is a bigger problem for the storyboard because the moodboard can be way more eclectic.
Very good summary! I imagine writer would have similar issues when using these models. It can be the goal of your tool to address some of these issues.
Model Limitations: Dall E 3
Positives
Recap on using Dall E 3 for generating complex scene
Recap on using SDXL with GPT-4 and Claude for generating complex scene
Recap on using Dall E 3 vs SDXL with GPT-4 for generating image of setting: