Closed StellarCheng closed 1 year ago
There are no conflicts because the technics of blending fusion and network architecture are basically orthogonal. During inversion, we also inflate the original 2d self-attention to be spatial-temporal self-attention. Then, we store the attention maps of spatial-temporal self-attention for each denoising step. Finally, we fuse the stored attention map during editing. You may check our implementation here, which is called during both inversion and denoising.
Ok, that's make sense! Thanks for your quick reply!
Thanks for your works! I wonder where did you implement the Spatial-Temporal Self-Attention. In the appendix algorithm 1 and 2, only attention fusing is shown. And I also puzzled that would attention fusing conflict with Spatial-Temporal Self-Attention, because if attention fusing is performed afterSpatial-Temporal Self-Attention, it seems to be equivalent to not having performedSpatial-Temporal Self-Attention at all.