Closed markojak closed 1 year ago
Hi, thanks for your interest!
As we show in the paper in table 1, it's very challenging to perfectly reverse engineer portraits. I think there are two main reasons: 1. it's not easy to learn hard prompts to represent human faces which are very complicated to describe. 2. Stable Diffusion sometimes is not good at generating portraits. Therefore, I would recommend you use longer prompts by changing args.prompt_len
and also use Midjourney for the generation (you can get free access in their Discord channel).
I simply ran one shot on your image with args.prompt_len=16
, and got the prompt:
daria muellerlittlemix neva ilio kirby heirloom nye blonstockheadshot haihairdresser inheritaswell vibrant
The image below is the generation from Midjourney, which is not perfect but way better than the previous result.
I find it both amazing and weird that the resulting prompts can be even used in other neural networks.
As far as I understand, the current implementation uses Stable Diffusion 2.1, as a diffusion model? What if I want to generate images with 1.5? Is it possible to use another model? Would it give better results?
Actually, we are not directly using Stable Diffusion 2.1. We use the CLIP model used by Stable Diffusion 2.1 (tho Stable Diffusion 2.1 only uses its text encoder). So, the prompts will be transferable to other models that also use the same CLIP text encoder.
Meanwhile, most words in the learned prompts are human-readable, so it is possible for them to be transferable to other models.
If you want to generate images with 1.5, I'd recommend you switch the CLIP model to the one used by 1.5. You can find more details here: #1 .
Oh, that makes sense. Thanks for the response and this amazing project. I'll try using different CLIP models.
This is such an awesome project. Thanks for building this. Trying to figure out how I would go about reverse engineering an intricate photorealistic portrait like this image
If I run this currently I get this: best cosine sim: 0.4274442791938782 best prompt: beatrice wolfdgers haircreative oirswolivanka
And the images that it outputs are https://share.cleanshot.com/GNsS4hJ9
You mentioned additional steps to figure out the optimal prompt. I don't mind training further if it can reveal counter-intuitive keywords that reveal output that we'd like to get.
Thoughts?