Closed zhangquanwei962 closed 1 year ago
We appreciate your interest in our work. In the code, the 15 FC layers are introduced in the form of a 5-layer transformer. Since the number of tokens is 1, one transformer is equivalent to 3 FC layers. Thus, a 5-layer transformer is equivalent to 15 FC layers.
If there a reason you did this? Is it because the original stable diffusion expect clip embeddings input in this way? or maybe because you're using a learned embeddings of the reference images?
Thank you for your excellent work!
As above, I don't know where the 15 FC in this code. Can you help me?