facebookresearch / neural-light-fields

This repository contains the code for Learning Neural Light Fields with Ray-Space Embedding Networks.
MIT License
182 stars 11 forks source link

Local affine embedding #6

Closed dlutzzw closed 1 year ago

dlutzzw commented 1 year ago

Hello, this is a very interesting job.

But I have some questions that I am very confused about. After parameterizing a ray, why does an affine transformation and offset have such a great impact on the final result?(especially on Stanford data). It is mentioned in this article that this kind of embedding can guarantee the consistency of multi-view ray, but it is also mentioned in limitation that the effect is not good when the input coordinates are changed to Plucker coordinates. So it's hard for me to understand the role of local affine embedding. Can you help me?

breuckelen commented 1 year ago

Hi, thanks for your questions!

The goal of re-parameterizing / embedding the input rays is so that the model can both memorize and interpolate the light field more effectively.

But how can an embedding of the input lead to multi-view consistent interpolation and better memorization? Let's consider an "ideal" embedding. An ideal embedding network will map all rays in the scene that observe the same 3D point to the same location in the embedded space. If the embedding network is able to do this for rays in our training images, as well as for rays that are not observed during training (interpolated or extrapolated rays), then multi-view consistency comes for free --- all rays that observe the same point will be mapped to the same location in embedded space, and our second network will map all of these rays to the same color. Without embedding, a naive MLP architecture does not know what to do for rays that it does not observe during training, and so it will not interpolate views in a multi-view consistent manner (see examples on our website: https://neural-light-fields.github.io/results.html). An ideal embedding also improves light field memorization, because rather than learning to associate each ray in the input 4D space with a color, our second network only has to learn to map locations in the embedding space (where each location corresponds to a 3D point) to colors.

Okay, so on to your first question: why is a locally affine re-parameterization a good choice of embedding for a two-plane parameterized light field (like the Stanford Light Fields)? The reason is that in the two-plane parameterization, all rays that observe the same 3D point form an affine subspace. We only need to learn a single affine transform for all rays in this subspace in order to map them to the same location in embedded space. Further, if the z-depth of the scene is smoothly varying, then the shape of these affine subspaces will also vary smoothly --- and thus we can learn a smoothly varying set of affine transforms across all rays. Essentially, the choice to let our embedding network predict affine transforms means that our "ideal" embedding signal is smooth, and thus easy to learn / interpolate effectively with a simple MLP architecture.

As to your second question, this does not work as well for Pluecker parameterizations, because rays observing the same 3D point no longer comprise an affine subspace. Other smooth embeddings, which either take advantage of the structure of Pluecker or are agnostic to the underlying parameterization of the light field are required for better performance here.

Note that we try to address many of these questions in section 4, and in the supplement of our paper: https://arxiv.org/pdf/2112.01523.pdf.

I hope this helps with some of your confusions. Please feel free to follow up if you have any additional questions.

dlutzzw commented 1 year ago

Thanks for your reply, I still have a few questions to discuss with you.

First question: "Let's consider an "ideal" embedding. An ideal embedding network will map all rays in the scene that observe the same 3D point to the same location in the embedded space. " The meaning of this sentence is: for the light field, a 3D point emits radiance in all directions; and the role of the embedding network is to map each light ray collected from the 2D images back to the corresponding 3D point.Then use MLP to predict the color so that the light ray emitted from the same 3D point has a similar color, so as to achieve mluti-view consistency. Right?

The second question: (if the above sentence is understood correctly.) The above discussion is in the case of "idea" embedding network, but for the light rays extracted from the 2D images (sometimes the radiance of the light ray emitted from the same 3D point may be view-dependent), then how does the embedding network find the correct 3D point for these rays. In other words, how to ensure that the rays from the same 3D point can get the ideal 3D point feature (the same location of embedding space)after passing through this embedding network.

breuckelen commented 1 year ago

Both good questions! Regarding your first question, that's correct. For your second question, the proposed architecture can handle view dependence in a couple of ways:

1) The embedding network is only locally affine, rather than globally affine. For this reason, it can capture effects like distorted reflections, which create "warped" color level sets that are approximately affine locally, but again no longer globally affine. Please see our results for the shiny dataset on our website, containing a couple of scenes with distorted reflections/refractions through liquid.

2) Our affine embedding network actually outputs several re-parameterizations per ray (effectively predicting multiple points per ray) -- for this reason, if the model needs to, it can retain information about ray direction in the embedding space, and allow the second network to predict colors that depend on viewing direction.

With regards to how we can drive optimization to find a good embedding (e.g. to "match" different rays that observe the same content in the embedding space) --- we actually never explicitly regularize the embedding to produce good point features / matches. One of the biggest surprises of this work, to us, was that we do not need explicit regularization for the embedding to work (although it certainly might benefit from explicit regularization). The color loss alone (and the finite capacity of the second MLP) seems to be enough to drive the model to match rays that observe the same points.

It is, however, important to either (1) have dense training data (like the Stanford scenes), so that the embedding can be learned effectively, or to (2) reduce the complexity of the scene so that the embedding is easier to model. For this second item, we make use of a subdivided light field architecture, with a set of light fields within voxels in a voxel grid. In this case, the light field, and thus embedding, for each voxel is much simpler to learn than the embedding for a full scene's light field. Another way of thinking about this is that the "search space" for matching rays is much more tractable for light fields that live within small voxels. See section 5 for discussion of our subdivided light field architecture.

dlutzzw commented 1 year ago

Thank you for your patient answer, I understand the second question I asked from the above answer.

The main reason is that the expression ability of the latter MLP network is limited, so the embedding network is forced to learn the match relationship, which makes the later MLP easier to learn. But this implicit coercion may not work so well, so we also need dense data or reduced scene complexity to make this match predict better (as you said in your answer).

After reading your two replies, I read the paper again, and I still have two questions:

The method of local affine embedding is: The light rays emitted from the same 3D point are regarded as an affine subspace (or a ray vector space),and then the job of the embedding network is to map all the rays in the subspace to the same point in the embedding space (that is, to the same embedding feature vector). This process can be seen as the mapping from the rays vector space to the feature vector space. My first question is:is it not possible to use feature embedding? Why must use the form of AX+B (local affine embedding)?

My second question is:

For the above question, why use re-parameterization in the form of AX+B, I read Appendix B in the paper.But I'm at a loss as to what "interpolation kernels aligned with..." means, and after reading this explanation several times it's still hard to understand why the AX+B form is better. 微信图片_20221008225832

Maybe I'm missing some concepts, or misunderstood.

Since the ideas of this paper I think are worth learning, I want to really understand the motivation of the design. Sorry to bother you.

breuckelen commented 1 year ago

No worries at all! I appreciate your questions and interest in the work.

Let me address the second question first. The part of the supplement that you referenced shows that the subspace corresponding to a 3D point is affine, and derives an expression for this affine transform in terms of the depth of the point, and the depth of the planes in the original parameterization. This expression shows that the larger the depth difference between the planes in the original parameterization and the depth of the point, the bigger the difference between the original parameterization and the true (x, y) coordinates of the point that the ray intersects. The affine re-parameterization network can learn to "undo" this transform.

With regards to the interpolation kernel explanation --- this is meant to justify why an MLP with axis aligned positional encoding needs to match rays to 3D points in order to interpolate well (which I stated without much justification in my earlier answers). An axis-aligned Fourier Feature Network induces interpolation kernels that are axis-aligned on its input space. In this case, without embedding, the kernels of a vanilla light field network will be aligned with (u_hat, v_hat). When the original parameterization is bad, then the difference between (s_hat, t_hat) and (u_hat, v_hat) is large, and these interpolation kernels will not be aligned with (s_hat, t_hat). It's much better to have interpolation kernels that only vary when we change the 3D point that a ray intersects. Otherwise, we might get unwanted changes in the output light field for unobserved views.

For your first question, the reason that affine embedding is better than feature embedding is again related to our desire to make the embedding signal as simple/smooth as possible. If we consider a scene with smooth, or constant z-depth, then the affine embedding network can actually learn the same affine transform for all rays (even if these rays observe different points!) --- as discussed above, the orientation of the affine subspace only depends on z-depth. However, a feature-embedding will have to learn a different feature for every 3D point. This is especially a problem for scenes with complex spatial texture.

Taking this idea further, it may be possible to design embeddings that are smoother/simpler for a wider variety of scenes. For example, you might be able to predict plane parameters with the embedding network (e.g. plane normal, distance from origin), or other geometric primitives, which could work well for scenes that are approximately planar (but do not necessarily have constant z-depth).

dlutzzw commented 1 year ago

I seem to understand what you mean.

Take the ideal case in Appendix B as an example (texture square--every 3D surface point has the same depth) image

Under this condition, the purpose of the affine transformation is to calculate (s, t) by using the input (x, y, u, v) ray coordinates and using Zst and Zuv of known depths. If a set of rays are emitted from the same 3D point, then their computed (s, t) is equal. Also, according to similar triangles, their affine transformation from (x, y, u, v) to (s, t) is the same. In a more special case, if the depth of the entire texture square is the same, then the affine transformation of all rays from (x, y, u, v) to (s, t) is the same - according to similar triangles. Right?

Finally, the advantage of using loacl affine embedding (AX+B) is that for a set of rays emitted from the same 3D point, their affine transformations are the same (similar triangles), so it is easier to learn for the embedding MLP network.

But for feature-space embedding, for a set of rays emitted from the same 3D point, if they want the same embedding feature after passing through the embedding network, the network needs to learn different transforms on each ray (unknown transform), this task is more difficult for embedding MLP network.

I don't know whether my understanding is correct or not.