Open hsilvaga opened 2 weeks ago
Hello, thank you for your interest in our work. As explained in Sec. B in the Appendix (Details on Extension to 3 Input Views), when generating the Gaussian for each view in the 3-view input case, we concatenate the feature tokens of all the other views and do the cross-attention with these concatenated tokens at the ViT decoder stage. Th, it is still feed-forward and does not require global alignment. We will release the 3-view model soon.
Hello,
I'm wondering how the results for 3 input views were obtained? It seems that the network is structured to only accept 2 input views. Was there some sort of global alignment used like in Mast3r?