SUDO-AI-3D / zero123plus

Code repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.
Apache License 2.0
1.56k stars 108 forks source link

Reference Only Local Conditioning Clarification #53

Closed fradif96 closed 6 months ago

fradif96 commented 7 months ago

Great work! I have a question regarding the reference-only attention implementation. In the paper it is written that 'Reference Attention refers to the operation of running the denoising UNet model on an extra reference image and appending the self-attention key and value matrices from the reference image to the corresponding attention layers when denoising the model input'. I would kindly ask if the "append" operation should be intended as a concatenation between the tensors. I mean: if for example both conditioning and input latents are [1, 4, 32] and the Q,K,V project both in e.g. [1, 4, 5], then the concatenation along first dimension should give us a result of [1,4,5] for Query and [1,8,5] for Key and Value. Self-attention matrix should finally result in [1,4,8].

Is this the intended computation? Thanks in advance

eliphatfs commented 7 months ago

Yes.