Great work! I have a question regarding the reference-only attention implementation. In the paper it is written that 'Reference Attention refers to the operation of running the denoising UNet model on an extra reference image and appending the self-attention key and value matrices from the reference image to the corresponding attention layers when denoising the model input'.
I would kindly ask if the "append" operation should be intended as a concatenation between the tensors. I mean: if for example both conditioning and input latents are [1, 4, 32] and the Q,K,V project both in e.g. [1, 4, 5], then the concatenation along first dimension should give us a result of [1,4,5] for Query and [1,8,5] for Key and Value. Self-attention matrix should finally result in [1,4,8].
Is this the intended computation?
Thanks in advance
Great work! I have a question regarding the reference-only attention implementation. In the paper it is written that 'Reference Attention refers to the operation of running the denoising UNet model on an extra reference image and appending the self-attention key and value matrices from the reference image to the corresponding attention layers when denoising the model input'. I would kindly ask if the "append" operation should be intended as a concatenation between the tensors. I mean: if for example both conditioning and input latents are [1, 4, 32] and the Q,K,V project both in e.g. [1, 4, 5], then the concatenation along first dimension should give us a result of [1,4,5] for Query and [1,8,5] for Key and Value. Self-attention matrix should finally result in [1,4,8].
Is this the intended computation? Thanks in advance