Closed yunfan1202 closed 1 year ago
Hi Yunfan,
For inference, I think the only limit is your GPU memory. For training, we use 3-20 images to as input frames. We also tried 3-50 and it can still work quite well (even better).
The model will get BxN images, process them as BxNxC features, go through a transformer (attention modules), and then predict camera poses by several MLPs. Therefore, when you feed in a different number of input frames, only N would be changed, and the transformer architecture can naturally deal with this. We don't need additional operations here.
Got it! Thanks for your detailed reply!
Hi, excellent work! The paper claims "The method can predict intrinsics and extrinsics for an arbitrary amount of images". I'm wondering what is the upper bound (maybe a fixed value?) of that arbitrary amount of input images for training the pose diffusion model.
Also, after the pose diffusion model is pre-trained under each scene all with, for example, 50 input images, the question is, how to process or concatenate the features of extracted image features (by DINO) with fewer input images, like 20? I tried to find the answer and related implementation details in the paper but failed, did I miss something?
Thank you so much!