YuxinWenRick / hard-prompts-made-easy

MIT License
591 stars 54 forks source link

reproduce result that only uses soft prompt #27

Closed zhixiongzh closed 2 months ago

zhixiongzh commented 2 months ago

Hi,

in the paper it is claimed that "We note that even though Stable Diffusion and CLIP share the same text encoder, soft prompts do not transfer well compared to all hard prompt methods in our evaluation".

How could I reproduce this soft prompt result with you code? I guess I need to directly pass the soft prompt embedding to the Stable diffusion but not sure how to pass this to SD, as SD only support hard prompt as input. Even though the SD accepts prompt embedding as input, the format of this prompt embedding is different from the one you optimized.

Thanks in advance for any guidance.

YuxinWenRick commented 2 months ago

Hi, thanks for reaching out.

You can pass the soft prompt as prompt_embeds.

For the soft prompt optimization, you can simply don't do the forward projection: https://github.com/YuxinWenRick/hard-prompts-made-easy/blob/main/optim_utils.py#L163-L164, and do projected_embeds = prompt_embeds, and at the end, you can return prompt_embeds as the final soft prompt.

Let me know if you have further questions!

zhixiongzh commented 2 months ago

Hi, thanks for reaching out.

You can pass the soft prompt as prompt_embeds.

For the soft prompt optimization, you can simply don't do the forward projection: https://github.com/YuxinWenRick/hard-prompts-made-easy/blob/main/optim_utils.py#L163-L164, and do projected_embeds = prompt_embeds, and at the end, you can return prompt_embeds as the final soft prompt.

Let me know if you have further questions!

@YuxinWenRick Thanks for the answer!

If this is how you transfer the soft prompt from the CLIP to Stable Diffusion, then it is a bug and thus your claim in the paper is not correctly verified.

The reason is, the prompt_embeds returned in your optimization code is the embedding without positional information and without processed by the text encoder (so it is just an embedding in the look up table). However, the prompt_embeds accepted by the Stable diffusion should be the embedding that has positional information and processed by the text encoder. That means, it is prompt features instead of prompt embedding(their name is misleading). Please refer to this example given by official huggingface team. You can see that they use text encoder to process the hard prompt instead of from the "lookup table". Just print the prompt_embs they create, then you will find that the padding embedding is not the same, proving that it is not just embedding in the look up table but the features after feeding to the text encoder model.

YuxinWenRick commented 2 months ago

Hi, I think you are right. You need to pass it to the text encoder first. That's a nice catch.

I double-checked the code I implemented for the soft prompt experiments. Actually, what we did was to train a soft prompt and then project it to the nearest neighbor as a hard prompt. The diffusers version (v0.11.0) we were using actually didn't support prompt_embeds at that moment: https://github.com/huggingface/diffusers/blob/v0.11.0/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L409-L424.

Meanwhile, we mentioned this in our paper: image, as we focus on discrete prompt optimization.

However, I am curious if the real soft prompts work. Have you tried it using the method you mentioned?

zhixiongzh commented 1 month ago

Hi, I think you are right. You need to pass it to the text encoder first. That's a nice catch.

I double-checked the code I implemented for the soft prompt experiments. Actually, what we did was to train a soft prompt and then project it to the nearest neighbor as a hard prompt. The diffusers version (v0.11.0) we were using actually didn't support prompt_embeds at that moment: https://github.com/huggingface/diffusers/blob/v0.11.0/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L409-L424.

Meanwhile, we mentioned this in our paper: image, as we focus on discrete prompt optimization.

However, I am curious if the real soft prompts work. Have you tried it using the method you mentioned?

Thanks for the responsible double-check!

I did try the method, but I found the prompt embedding that the stable diffusion accept is still different from that processed by the method I mentioned, and can not find the reason yet. I leave it as a bug for now and continue with others. Anyway, thanks for the clarity!