YanzuoLu / CFLD

[CVPR 2024 Highlight] Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis
MIT License
183 stars 12 forks source link

关于姿态的一些问题 #2

Closed EveningLin closed 8 months ago

EveningLin commented 8 months ago

image 为什么会考虑到在前半阶段使用pose而不是在后半阶段呢

YanzuoLu commented 8 months ago

This practice follows T2I-Adapter [arXiv:2302.08453] to efficiently inject structural information without fine-tuning the entire U-Net, and we actually did not declare this as one of our contributions.

The fact is, we were inspired by it and derived the Hybrid-Granularity Attention module (HGA) to decouple the fine-grained appearance and pose information controls, thus circumventing the potential overfitting problem. To clarify, we also post our explanation of motivation and overfitting in the rebuttal phase here,

Our observation is that previous diffusion-based methods would fit the spatially convolutional features of source image into noisy sample directly. But this doesn't make sense in practice, because the texture details of source image probably shouldn't be present in the same position of target sample, especially in the exaggerated pose transition case. Since the model is actually performing copy-and-paste, the generations are distorted and unnatural, which we call this phenomenon overfitting and lack of generalization ability.

thanks for your attention to our paper. If you think our work is of some help to you, please consider giving us a star🌟. If you have more questions, please feel free to reply below this issue.