IVRL / VolRecon

Official code of VolRecon (CVPR 2023)
MIT License
134 stars 15 forks source link

Questions about source views #2

Closed GuangyuWang99 closed 1 year ago

GuangyuWang99 commented 1 year ago

Hi, thanks for the excellent work! Here I have a small question regarding the term 'source views' used in this project.

According to the code, the image to be rendered (denoted as 'gt-color' for brevity) is also included in 'source views', which means that the model aims to aggregate 'gt-color' itself along with some neighbor views to render the 'gt-color'. If this is the case, I wonder how can the efficacy of rgb-render-loss be gauranteed? Since the view transformer simply needs to pick the gt-color and gives zero weights to all other views. However, as suggested in Tab. 4 of the original paper, pure rgb-render-loss still delivers a reasonable result with a chamfer of 2.04.

Could you please clarify more on this point? Thanks in advance!

FangjinhuaWang commented 1 year ago

The training is similar to IBRNet and sparseneus, only neighboring views (i.e. 'source views') are used to render color and depth at a query viewpoint. (i.e. 'reference view')

GuangyuWang99 commented 1 year ago

Sorry for the misunderstanding! After carefully checking the code again, I do find that in training the definition of 'reference view' and 'source views' follows the convention of MVSNet (i.e. 'pair.txt'). When performing geometry reconstruction, the reference-view and source views are jointly taken as inputs (Line 249 & 250 in 'dtu_test_sparse.py'), which is the same as MVS methods. However, the reference view should be removed if we want to test the novel-view synthesis performance, in a way similar to IBRNet.

FangjinhuaWang commented 1 year ago

During sparse reconstruction, we render depth and rgb at a 'virtual' viewpoint that is around a given view. During full view reconstruction, since we evaluate depth map metric and view synthesis (in supp.), we render depth and rgb at the reference view and do not use the gt rgb at reference view (i.e. source view only). These are all discussed in the paper.

SeaBird-Go commented 1 year ago

@FangjinhuaWang , sorry to ask quesions in this closed issue again. I also feel confused about the views. I want to know what the exact meanings of sparse reconstruction and full view reconstruction?

From the paper, I know the sparse reconstruction means we only use very few (3 in paper) to reconstruct the mesh. But the mesh maybe not a 360 degree complete mesh, right?

And from the Sec. 4.2, I guess the full view reconstruction means you use all of the 49 depth maps, and then fuse them into a point cloud to compute the metrics. But in the novel view syntheis (in supp.), you mentioned that you only use 4 input views to do rendering using the same dataset settings as full view reconstruction. So what the meanings of the 4 input views?

FangjinhuaWang commented 1 year ago

Let's say the 49 viewpoints are: I0, I1, ... I48. In full view reconstruction, we render rgb and depth at each viewpoint. During rendering for each viewpoint, we choose 4 source views as input. For example, when rendering I0, we may use the four known images I1, I2, I3 and I4. When rendering I10, we may use another set of known images, e.g. I8, I9, I11, I12. In experiment, we use the four views with highest view selection score.

SeaBird-Go commented 1 year ago

Thanks for your quickly reply! It makes me understand the full view reconstruction very well.

BTW, during the sparse view reconstruction, you only use 3 input views, and try to fit the SRDF to infer the rendering rgb and depth maps. So I want to ask why you need to define a virtual rendering viewpoint by shifting the original camera coordinate frame for d = 25mm along its x-axis. Just to validate whether the learned model could adapt to different viewpoints?

FangjinhuaWang commented 1 year ago

If we render at a given viewpoint I_0, then the projected 2D features in this viewpoint will always be the same for all samples along a ray. Since the pipeline is similar as novel view synthesis, we need to render at 'novel viewpoints'. The offset d=25mm is randomly chosen and is reasonable to form a stereo rig. If d is too large, then there will exist huge occlusion. You can adjust this value, render more virtual views and then fuse all of the depth maps.