YuxinWenRick / hard-prompts-made-easy

MIT License
580 stars 54 forks source link

Questions around running this to get more usable prompts #3

Closed markojak closed 1 year ago

markojak commented 1 year ago

This is such an awesome project. Thanks for building this. Trying to figure out how I would go about reverse engineering an intricate photorealistic portrait like this image

If I run this currently I get this: best cosine sim: 0.4274442791938782 best prompt: beatrice wolfdgers haircreative oirswolivanka

And the images that it outputs are https://share.cleanshot.com/GNsS4hJ9

You mentioned additional steps to figure out the optimal prompt. I don't mind training further if it can reveal counter-intuitive keywords that reveal output that we'd like to get.

Thoughts?

YuxinWenRick commented 1 year ago

Hi, thanks for your interest!

As we show in the paper in table 1, it's very challenging to perfectly reverse engineer portraits. I think there are two main reasons: 1. it's not easy to learn hard prompts to represent human faces which are very complicated to describe. 2. Stable Diffusion sometimes is not good at generating portraits. Therefore, I would recommend you use longer prompts by changing args.prompt_len and also use Midjourney for the generation (you can get free access in their Discord channel).

I simply ran one shot on your image with args.prompt_len=16, and got the prompt:

daria muellerlittlemix neva ilio kirby heirloom nye blonstockheadshot haihairdresser inheritaswell vibrant

The image below is the generation from Midjourney, which is not perfect but way better than the previous result.

midj

Alphyn-gunner commented 1 year ago

I find it both amazing and weird that the resulting prompts can be even used in other neural networks.

As far as I understand, the current implementation uses Stable Diffusion 2.1, as a diffusion model? What if I want to generate images with 1.5? Is it possible to use another model? Would it give better results?

YuxinWenRick commented 1 year ago

Actually, we are not directly using Stable Diffusion 2.1. We use the CLIP model used by Stable Diffusion 2.1 (tho Stable Diffusion 2.1 only uses its text encoder). So, the prompts will be transferable to other models that also use the same CLIP text encoder.

Meanwhile, most words in the learned prompts are human-readable, so it is possible for them to be transferable to other models.

If you want to generate images with 1.5, I'd recommend you switch the CLIP model to the one used by 1.5. You can find more details here: #1 .

Alphyn-gunner commented 1 year ago

Oh, that makes sense. Thanks for the response and this amazing project. I'll try using different CLIP models.