为什么要采用Reference Image的信息用Spatial-Attention和Cross-Attention编码到Pose Sequence的Denoising UNet里面的方法而不是反过来

HumanAIGC / AnimateAnyone

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Apache License 2.0

14.23k stars 952 forks source link

Open hxypqr opened 8 months ago

hxypqr commented 8 months ago

有什么数学解释说明采用Reference Image的信息用Spatial-Attention和Cross-Attention编码到Pose Sequence的Denoising UNet里面的方法比反过来更有优势吗

fenghe12 commented 5 months ago

没有什么数学原理吧就是效果更好现在没办法很好地从数学上解释