I want to express my appreciation once again for sharing this exceptional work! I am really interested in fully understanding the paper and have a couple more questions.
What do Q, K, and V represent in Equation 2?
Based on my understanding, the aim is to calculate cross-attention between overlapped frames. Therefore, the query would correspond to the source features, while the key and value would refer to the target features. However, I noticed that you explicitly denoted the features as $\bar{F}$. This has left me a bit confused about the nature of Q, K, and V.
How were occlusions handled in the context of $K > 1$ in geometry-conditioned image generation?
In Figure 3, you presented the re-projection scheme used to obtain the source feature locations corresponding to their neighbors. However, in many cases, occlusions may occur, making it impossible to guarantee the presence of correspondences. While this may not be problematic in panorama generation where correspondences are assured, in the case of geometry-conditioned image generation with $K > 1$, handling occlusions becomes crucial. Have you explored scenarios where $K > 1$ in this context and discovered any solutions to address occlusion-related challenges?
Thank you once again for your time and for considering my questions. I am eager to gain a deeper understanding of your work.
Hi, @Tangshitao!
I want to express my appreciation once again for sharing this exceptional work! I am really interested in fully understanding the paper and have a couple more questions.
Based on my understanding, the aim is to calculate cross-attention between overlapped frames. Therefore, the query would correspond to the source features, while the key and value would refer to the target features. However, I noticed that you explicitly denoted the features as $\bar{F}$. This has left me a bit confused about the nature of Q, K, and V.
In Figure 3, you presented the re-projection scheme used to obtain the source feature locations corresponding to their neighbors. However, in many cases, occlusions may occur, making it impossible to guarantee the presence of correspondences. While this may not be problematic in panorama generation where correspondences are assured, in the case of geometry-conditioned image generation with $K > 1$, handling occlusions becomes crucial. Have you explored scenarios where $K > 1$ in this context and discovered any solutions to address occlusion-related challenges?
Thank you once again for your time and for considering my questions. I am eager to gain a deeper understanding of your work.
Best regards, Sang Min Kim