Where the "We utilize 15 fully-connected (FC) layers to decode the feature from pretrained encoder and inject it into the diffusion process through cross attention." in this code?

Fantasy-Studio / Paint-by-Example

Paint by Example: Exemplar-based Image Editing with Diffusion Models

https://arxiv.org/abs/2211.13227

Other

1.03k stars 93 forks source link

Where the "We utilize 15 fully-connected (FC) layers to decode the feature from pretrained encoder and inject it into the diffusion process through cross attention." in this code? #15

Closed zhangquanwei962 closed 1 year ago

zhangquanwei962 commented 1 year ago

Thank you for your excellent work!

As above, I don't know where the 15 FC in this code. Can you help me?

Fantasy-Studio commented 1 year ago

We appreciate your interest in our work. In the code, the 15 FC layers are introduced in the form of a 5-layer transformer. Since the number of tokens is 1, one transformer is equivalent to 3 FC layers. Thus, a 5-layer transformer is equivalent to 15 FC layers.

francotheengineer commented 2 weeks ago

If there a reason you did this? Is it because the original stable diffusion expect clip embeddings input in this way? or maybe because you're using a learned embeddings of the reference images?