Open Pomelover opened 1 week ago
Because the function of pretrained Q-former is to extract image features consistent with the text, we use Q-former to receive style or content in order to generate decoupled representations. This is a major contribution to achieving reference image content and style decoupling from the perspective of feature extraction.
Thank u for sharing this work. I have a question about the paper. Why you use Q-Former to receive the prompt("Style" or "Content)? If it is possible to give the prompt to the U-Net and fine-tuning it?