Questions about cross frame attention

Hi there, thank you for your great work!

I am using the depth controlnet version. Just a few questions about cross-frame attention. According to the paper, the way that you do cross frame attention is to fix all the K and V to the first frame of the chunk, and iterate Q. But according to the code, K and V are generated by an encoded hidden state, which is from the prompt. However, we only have one prompt in the beginning, which is then copied several times, which means different slices of the encoded hidden state are the same, aren't they? If so, it seems there is no difference between cross frame attention and self attention. I am quite new to this field. Could you help me with this

Picsart-AI-Research / Text2Video-Zero

Questions about cross frame attention #74