Fantasy-Studio / Paint-by-Example

Paint by Example: Exemplar-based Image Editing with Diffusion Models
https://arxiv.org/abs/2211.13227
Other
1.03k stars 93 forks source link

Transformer mapper #14

Closed Teoge closed 1 year ago

Teoge commented 1 year ago

Hi, thanks for your excellent work.

I noticed that you used a transformer to map clip image embeddings to stable diffusion conditions. https://github.com/Fantasy-Studio/Paint-by-Example/blob/main/ldm/modules/encoders/modules.py#L144-L149 But the sequence dimesion of clip image embedding is 1, which means there is no attention in the transformer. I find it a little bit confusing. Do you have a particular reason for this?

Thanks.

Fantasy-Studio commented 1 year ago

We appreciate your interest in our work. Actually, a transformer is equivalent to 3 FC layers since the number of tokens is 1. In the paper, we also claim that several FC layers are used instead of several transformers. In the beginning, we explored using the 257 tokens of clip image embeddings. Thus we used several transformers to decode the embeddings. Because a transformer is actually equivalent to 3 FC layers when the number of tokens is 1, we did not replace the transformer with several FC layers for a better ablation study.