ChenyangQiQi / FateZero

[ICCV 2023 Oral] "FateZero: Fusing Attentions for Zero-shot Text-based Video Editing"
http://fate-zero-edit.github.io/
MIT License
1.12k stars 108 forks source link

Where is the Spatial-Temporal Self-Attention? #13

Closed StellarCheng closed 1 year ago

StellarCheng commented 1 year ago

Thanks for your works! I wonder where did you implement the Spatial-Temporal Self-Attention. In the appendix algorithm 1 and 2, only attention fusing is shown. And I also puzzled that would attention fusing conflict with Spatial-Temporal Self-Attention, because if attention fusing is performed afterSpatial-Temporal Self-Attention, it seems to be equivalent to not having performedSpatial-Temporal Self-Attention at all.

ChenyangQiQi commented 1 year ago

There are no conflicts because the technics of blending fusion and network architecture are basically orthogonal. During inversion, we also inflate the original 2d self-attention to be spatial-temporal self-attention. Then, we store the attention maps of spatial-temporal self-attention for each denoising step. Finally, we fuse the stored attention map during editing. You may check our implementation here, which is called during both inversion and denoising.

StellarCheng commented 1 year ago

Ok, that's make sense! Thanks for your quick reply!