TencentQQGYLab / ELLA

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
https://ella-diffusion.github.io/
Apache License 2.0
1.04k stars 54 forks source link

Details about EMMA #47

Open CharlesGong12 opened 1 month ago

CharlesGong12 commented 1 month ago

Such an excellent work! I am reading the paper EMMA and find that you didn't give the details about the face encoder and image encoder. So what are the encoders you use? By the way, could you explain your motivation of using gate network? Looking forward to your reply. Thanks!

budui commented 1 month ago
  1. Face Encoder: AdaFace (cause they use MIT license) , Person image Encoder: CLIP H/14 (same as IP-Adapter)
  2. Because I found that the 64 tokens within the Resampler have strong semantic biases, such as some being exclusively related to the foreground and others solely to the background. The gating layer can help EMMA leverage these biases.