TQTQliu / MVSGaussian

[ECCV 2024] MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
https://mvsgaussian.github.io/
MIT License
379 stars 19 forks source link

Issues on two-stage cascaded framework #4

Closed Yef-Huang closed 2 months ago

Yef-Huang commented 4 months ago

Great work! Are the number of sampling points in your NeRF module the same as the number of points in 3DGS, or are they the same points? Is the number of sampling points in the final level 2? Is the first level used only for depth estimation and does not introduce 3DGS? How do you handle the density of Gaussian points—are they predicted through MLP or mapped using PDF?

chenll12345 commented 4 months ago

I have the same questions too.

TQTQliu commented 4 months ago

@Yef-Huang @chenll12345 Thanks for your interest.

  1. Yes,they are the same points.
  2. The number of sampling points in the final level is 1, where a pixel corresponds to only a 3D point. We also discuss the case where the number of sampling points is 2 in the appendix.
  3. During training, in the first level, we still establish the pixel-aligned Gaussain representation and then render low-resolution views, which are supervised by the downsampled low-resolution ground-truth views to boost the depth estimation. During testing, the first level is used only for depth estimation and not produces low-resolution views.
  4. The density of Gaussian points is predicted from features through MLP.
chenll12345 commented 4 months ago

@TQTQliu Thank you for your very detailed answer, I understand everything! Do you think it is necessary to introduce 3DGS in the first layer? Or do you want to introduce some loss? The purpose of the first level is depth estimation, so can introducing 3DGS bring any benefits to depth estimation? If it doesn't bring any benefits, wouldn't it be better to remove the 3DGS network from the first level?

TQTQliu commented 4 months ago

@chenll12345 Yes, the introduction of 3DGS in the first layer to render low-resolution views is necessary, and the introduction of loss between low-resolution rendered views and GT is beneficial for depth estimation. Because I did an experiment to remove the GS rendering part of the first stage, and found that the final view quality metrics and depth accuracy would decrease (I did not present this ablation experiment in the paper). In addition, since we only introduce GS rendering in the first level during the training phase, we did not use it when testing or inference, so this does not affect the inference time.

chenll12345 commented 4 months ago

Thank you for your response. I sincerely hope that this excellent work will be open-sourced.

TQTQliu commented 2 months ago

Thanks for your attention, the code has been released.