Closed KomonoLi closed 7 months ago
Do you apply the self attention on the input and condition image independently and then concats their K+V matrices? Do you mind providing some advices?
Yes. You can check the code for details. (L43 to L174 of pipeline.py
)
Nice work! But I have a question regarding to the reference attention.
As mentioned in your paper, in zero123 it concats the condition image to the noisy input in the feature dimension for local conditioning. This does impose an incorrect pixel wise alignment between the input and condition image. But the noisy input is also guided with the condition image via cross attention.
I am confused about how you implement your reference attention. Do you apply the self attention on the input and condition image independently and then concats their K+V matrices? Do you mind providing some advices?