SUDO-AI-3D / zero123plus

Code repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.
Apache License 2.0
1.56k stars 108 forks source link

Reference attention #45

Closed KomonoLi closed 7 months ago

KomonoLi commented 7 months ago

Nice work! But I have a question regarding to the reference attention.

As mentioned in your paper, in zero123 it concats the condition image to the noisy input in the feature dimension for local conditioning. This does impose an incorrect pixel wise alignment between the input and condition image. But the noisy input is also guided with the condition image via cross attention.

I am confused about how you implement your reference attention. Do you apply the self attention on the input and condition image independently and then concats their K+V matrices? Do you mind providing some advices?

eliphatfs commented 7 months ago

Do you apply the self attention on the input and condition image independently and then concats their K+V matrices? Do you mind providing some advices?

Yes. You can check the code for details. (L43 to L174 of pipeline.py)