fraunhoferhhi / neural-deferred-shading

Multi-View Mesh Reconstruction with Neural Deferred Shading (CVPR 2022)
Other
253 stars 21 forks source link

Some insights #15

Open boqian-li opened 7 months ago

boqian-li commented 7 months ago

Hi, thanks for your great work! I am here to ask for some possible insights from you. I can see NDS can reconstruct a perfect surface with lots of details on it. I wonder how the details of surface can be captured. The mask loss can capture coarse shape so I think the details are captured by the shading loss. But why it works perfectly without any geo-prior and only with RGB images? Could you please give some insights or something related on that?

Aside from that, I also want to know if it's necessary to capture images under a fixed light? I mean if the light position is just the direction of camera view direction and move with it, will the result quite different?

mworchel commented 1 month ago

Hi,

you are correct, the details are captured by the shading loss (as shown in Figure 12 in the supplementary material). It does not work perfectly but converges to one possible surface that could explain the observation (i.e., the images).

There are many local minima (i.e., a combination of vertex positions and shader parameters) that fit the observation and I would argue there are two central points that help to converge to a reasonable one: (1) the visual hull initialization is already close to the actual surface; if this is not the case, e.g. for non-convex regions, the solution is not optimal (see failure case in Figure 14, supplementary material). (2) the representative power of the shader is (artificially) limited by the architecture so that there is an incentive to represent details in geometry; if the shader is too expressive, there is a chance that details are "baked" into the appearance (see e.g. the SIREN architecture in Figure 9).

Analysis-by-synthesis techniques are susceptible to visual ambiguities (see Figure 6 in this paper), even more so in our case when the appearance is modeled by a black box neural network that does not adhere to the physics of light.

As for the question of (moving) light: since we are training a single shader that makes no distinction between the cameras and only considers viewing angles, the appearance should be consistent between views. Therefore, I wouldn't expect meaningful results for co-located camera and light images.