Closed gaoyixuan111 closed 4 months ago
Hi, @gaoyixuan111, this is a preparation for the release of the multi-ID version later. The trigger word </image/> is to meet the needs of users to enter multiple images. You can refer to the idea of Photomaker.
When our paper is officially accepted, we will update the new version of the paper and release more features mentioned. If you have any questions, please feel free to ask or PR.
@JackAILab Why freeze the text cross-attention? Since image features are integrated into text embeddings, the original text cross-attention cannot recognize them. Why is training the facial encoder sufficient to solve this problem? Have you tried setting the text cross-attention to be trainable?
@JackAILab Why is the trigger word </image/> set in the face encoder? What is the purpose of </image/>? Is it necessary?