Could you explain the reason of why you use Q-Former to receive the prompt("Style" or "Content)?

bytedance / DEADiff

[CVPR 2024] Official implementation of "DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations"

Apache License 2.0

190 stars 4 forks source link

Could you explain the reason of why you use Q-Former to receive the prompt("Style" or "Content)? #13

Open Pomelover opened 1 week ago

Pomelover commented 1 week ago

Thank u for sharing this work. I have a question about the paper. Why you use Q-Former to receive the prompt("Style" or "Content)? If it is possible to give the prompt to the U-Net and fine-tuning it?

Tianhao-Qi commented 1 week ago

Because the function of pretrained Q-former is to extract image features consistent with the text, we use Q-former to receive style or content in order to generate decoupled representations. This is a major contribution to achieving reference image content and style decoupling from the perspective of feature extraction.