Picsart-AI-Research / Text2Video-Zero

[ICCV 2023 Oral] Text-to-Image Diffusion Models are Zero-Shot Video Generators
https://text2video-zero.github.io/
Other
3.91k stars 336 forks source link

Questions about cross frame attention #74

Closed jingwu2121 closed 5 months ago

jingwu2121 commented 5 months ago

Hi there, thank you for your great work!

I am using the depth controlnet version. Just a few questions about cross-frame attention. According to the paper, the way that you do cross frame attention is to fix all the K and V to the first frame of the chunk, and iterate Q. But according to the code, K and V are generated by an encoded hidden state, which is from the prompt. However, we only have one prompt in the beginning, which is then copied several times, which means different slices of the encoded hidden state are the same, aren't they? If so, it seems there is no difference between cross frame attention and self attention. I am quite new to this field. Could you help me with this