Question on "content representation extraction task" of Q former

bytedance / DEADiff

[CVPR 2024] Official implementation of "DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations"

Apache License 2.0

190 stars 4 forks source link

Question on "content representation extraction task" of Q former #11

Open WonwoongCho opened 1 month ago

WonwoongCho commented 1 month ago

Hi authors, thank you for sharing the awesome work.

As far as I understand, only the style representation from Q-former is used during the inference. If it is correct, why is the content training needed tho? Does it help the Q-former to have better disentangled representation for "style"?

Probably I missed some parts of the paper. Would appreciate it if somebody let me know. Thanks!

Tianhao-Qi commented 1 month ago

The goal of using dual content training is to help the model better distinguish between the style and the semantics of the reference image. Therefore, it will reduce the impact of reference image semantics and lead to better text alignment, as shown in Table 2 of our paper.