I am using the depth controlnet version. Just a few questions about cross-frame attention. According to the paper, the way that you do cross frame attention is to fix all the K and V to the first frame of the chunk, and iterate Q. But according to the code, K and V are generated by an encoded hidden state, which is from the prompt. However, we only have one prompt in the beginning, which is then copied several times, which means different slices of the encoded hidden state are the same, aren't they? If so, it seems there is no difference between cross frame attention and self attention. I am quite new to this field. Could you help me with this
Hi there, thank you for your great work!
I am using the depth controlnet version. Just a few questions about cross-frame attention. According to the paper, the way that you do cross frame attention is to fix all the
K
andV
to the first frame of the chunk, and iterateQ
. But according to the code,K
andV
are generated by an encoded hidden state, which is from the prompt. However, we only have one prompt in the beginning, which is then copied several times, which means different slices of the encoded hidden state are the same, aren't they? If so, it seems there is no difference between cross frame attention and self attention. I am quite new to this field. Could you help me with this