Epipolar cost volume and depth prediction

chenjiajie9811 commented 3 months ago

Hi there,

thank you for your great work! I have a question regarding the concept of using epipolar cost volume for the depth prediction.

It is reasonable to use this method when we consider the region where two images have overlap, we can search along the epipolar line and find the position with the highest similarity as the depth prediction. What if we are considering the pixels on the region without overlap, we won't find any truth correspondences on the epipolar line, and the cost volume we constructed might have equally low similarities, how is it possible for the network to learn the depth in this case?

I am a little bit confused and looking forward to your reply.

Regards

fangchuan commented 3 months ago

Same concerns. Furthermore, I also feel confused about how to get the real scale depth prediction from the approach mentioned in paper. By looking through the codes, it seems not significantly different with other learning-based stereo approach. But when I export the ply of gaussians, the point cloud scale is almost the same as the real world one, which means the predicted depth is correct in scale. I am wondering why the costvolume depth encoder network can figure out the scale without considering the stereo baseline?

donydchen commented 3 months ago

Hi, @chenjiajie9811, thanks for your interest in our work.

Our project indeed assumes that there are significant overlaps between the input views, and this is actually how the data is structured during testing (see the index-generating code at here).

For parts that have no overlap, MVSplat relies on the following UNet to help propagate the matching information (see Cost volume refinement in Sec. 3.1 of the paper). However, this solution is intuitive and may only somewhat ease the issue. It is a promising future direction to consider improving accuracy in those non-overlap regions.

Hi, @fangchuan. Your findings are pretty interesting. May I know how do you confirm that "the point cloud scale is almost the same as the real world one"? I remember there is no ground-truth 3D data for the RE10K or ACID dataset. In fact, MVSplat is not intended to predict real-scale depth (aka metric depth), which should be quite difficult to achieve without additional regularization. Instead, MVSplat merely aims to predict a relative depth bounded by the predefined near and far planes.

donydchen / mvsplat

Epipolar cost volume and depth prediction #42