cvg / NoPoSplat

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images
MIT License
484 stars 13 forks source link

Question: How was it extended to 3 views? #5

Open hsilvaga opened 2 weeks ago

hsilvaga commented 2 weeks ago

Hello,

I'm wondering how the results for 3 input views were obtained? It seems that the network is structured to only accept 2 input views. Was there some sort of global alignment used like in Mast3r?

botaoye commented 2 weeks ago

Hello, thank you for your interest in our work. As explained in Sec. B in the Appendix (Details on Extension to 3 Input Views), when generating the Gaussian for each view in the 3-view input case, we concatenate the feature tokens of all the other views and do the cross-attention with these concatenated tokens at the ViT decoder stage. Th, it is still feed-forward and does not require global alignment. We will release the 3-view model soon.