Closed EveningLin closed 8 months ago
This practice follows T2I-Adapter [arXiv:2302.08453] to efficiently inject structural information without fine-tuning the entire U-Net, and we actually did not declare this as one of our contributions.
The fact is, we were inspired by it and derived the Hybrid-Granularity Attention module (HGA) to decouple the fine-grained appearance and pose information controls, thus circumventing the potential overfitting problem. To clarify, we also post our explanation of motivation and overfitting in the rebuttal phase here,
Our observation is that previous diffusion-based methods would fit the spatially convolutional features of source image into noisy sample directly. But this doesn't make sense in practice, because the texture details of source image probably shouldn't be present in the same position of target sample, especially in the exaggerated pose transition case. Since the model is actually performing copy-and-paste, the generations are distorted and unnatural, which we call this phenomenon overfitting and lack of generalization ability.
thanks for your attention to our paper. If you think our work is of some help to you, please consider giving us a star🌟. If you have more questions, please feel free to reply below this issue.
为什么会考虑到在前半阶段使用pose而不是在后半阶段呢