Why use the information from the Reference Image to encode Spatial-Attention and Cross-Attention into the Denoising UNet's Pose Sequence method instead of the other way around?

HumanAIGC / AnimateAnyone

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Apache License 2.0

14.23k stars 952 forks source link

Why use the information from the Reference Image to encode Spatial-Attention and Cross-Attention into the Denoising UNet's Pose Sequence method instead of the other way around? #47

Open hxypqr opened 8 months ago

hxypqr commented 8 months ago

Is there any mathematical explanation to show that encoding the information from the Reference Image using Spatial-Attention and Cross-Attention into the Denoising UNet's Pose Sequence method is more advantageous than the reverse?