Closed sherrydoge closed 4 months ago
Hi, @sherrydoge Thank you for your careful observation. Cross-attention is an important parameter. In fact, our model partially unfreezes the cross-attention during training. You can refer to the discussion process of issue41.
The current version of the paper still has some flaws. We plan to update the new version of the paper and release more extended functions after the paper is officially accepted. If you have any ideas or questions, please feel free to PR and discuss~
Nice work! I tried your demo and got fancy results, but I'm confused about why the text cross-attention can be freezing. Since you fuse image features into the text embedding, the original text cross-attention cannot recognize them anymore. I wonder why training the face encoder is enough to figure this out, and if you have tried to set the text cross-attention trainable?