Closed SunghwanHong closed 3 months ago
Hello,
Thanks for highlighting your paper! I had overlooked it when I was reviewing the literature but I will include it when I update our ArXiv submission.
Regarding the scene scale, the foundation model that we use (MASt3R) predicts the scene geometry in metric scale, and the dataset we use for training (ScanNet++) has camera poses for images which are also in metric scale. We found that MASt3R does a surprisingly good job of estimating the scene scale, so we directly use the pose of the target image (transformed to the coordinate frame of the first context image) to generate our rendered target images. Occasionally this means that our renderings are misaligned (for example in the third column of Figure 4 in our paper), but for the most part this works well.
In an earlier version of this project based on DUSt3R we didn't make predictions in metric units, but we were able to estimate the difference in the predicted/ground truth scene scales using the normalization factors that were already being calculated for DUST3R's loss. However this still relies on having a dataset which has depth for every pixel in the same scale as the camera pose parameters.
Thanks for the reply! I'm excited to see the next updated version.
Regarding the scale, since Master's training dataset includes scannet++, it will definitely have approximately the right scale, yes. However this means that splatt3r should not claim itself as a zero-shot model (because it is trained and evaluated on scannet++. However, i was impressed by its zero shot results on in the wild images captured by phones) nor it does not use any other supervisions other than rgb (because it uses ground truth depth.). In the paper, it makes it hard to understand for the readers who are unfamiliar with the topic whether it is really only the rgb images that is required in general. If the method is used to train on different datasets (that were not used in mast3r), it will highly likely to fail to receive sufficient gradients due to scale difference. In other words, it can not use posed target images for supervision. This will change the whole problem formulation I think.
Please let me know if there is anything I misunderstood!
Yes, I am keen to emphasise that we require ground truth point maps at training time which makes our method not directly comparable to existing RGB-only methods. I hope I have made that clear in the paper.
Regarding your point about MASt3R being trained on a dataset that includes ScanNet++, that is fair. My presumption is that you would use the same datasets to train both the MASt3R model (as pretraining) and the Gaussian head. We use the pretained MASt3R weights for convenience and generalisability. If you wanted to use a different dataset, you would perform your MASt3R pretraining with that dataset and then train the Gaussian head. I suspect MASt3R's scale prediction is fairly robust, but to your point I would be interested to see how stable training is if the Gaussian head is trained using a dataset that wasn't used for the MASt3R pretraining stage.
We have been thinking about ways to better update the Gaussian parameters when the rendered target image is misaligned with ground truth target, and we may explore this in future work.
I am doing the same thing, but my model chooses dust3r. When I train it with re10k dataset, it is still blurry. PSNR is around 22. And I found that the estimated extrinsic and intrinsic parameters can be transformed by a certain ratio or transformation matrix. The estimated error is relatively small, but I have not conducted relevant experiments to see if it can be ignored.
Thanks for the information. I believe Mast3r should show quite a difference to dust3r, since it was trained with scaled depth points. My guess is that it may be possible to train on OOD datasets, but this will definitely not yield best performance, since Mast3r is not perfect.
Hi,
Thanks for the paper and code implementations!
I wanted to ask a quick question regarding the training procedure. In the paper, you mentioned that the posed target images are used. How did you align the GT camera pose scale and your prediction scale? Does the offset estimator account for the differences? In other words, how did you use gt posed images as your novel view target images?
By the way, I would be highly appreciated if you consider mentioning CoPoNeRF (https://ku-cvlab.github.io/CoPoNeRF/), since it is the first model that performs 3d reconstruction and novel view synthesis from a wide, unposed, stereo pairs of images in a feed forward manner.
Thanks!