facebookresearch / neuralvolumes

Training and Evaluation Code for Neural Volumes
Other
420 stars 52 forks source link

Questions: Does your work generalize for unseen training data ? #5

Closed phongnhhn92 closed 3 years ago

phongnhhn92 commented 4 years ago

Hello, Thanks for uploading the code, I would like to ask if your method is able to generalize to unseen training data ?

stephenlombardi commented 4 years ago

It depends on what you mean by unseen training data. There are multiple dimensions along which one might expect the method to generalize.

Does it generalize to different viewpoints? The answer for this is yes. We can train the model from a discrete set of viewpoints (around 30 for most examples in the paper) and expect it to interpolate between different viewpoints well. This is showed in the supplemental video.

Does it generalize to different animations, similar to the ones seen during training? Yes, depending on how much data the model is trained on. If we train the model on a large sequence of facial expressions, for example, we expect it to be able to interpolate between expressions seen during training (by interpolating the latent space), though the quality of this varies somewhat.

If you're asking whether the model can produce good results for an object different from the one seen during training, the answer is no. In this paper, the model is trained on a per-object/scene basis, so it's not expected that the model will do anything sensible for objects it hasn't seen.

phongnhhn92 commented 4 years ago

@stephenlombardi Thanks a lot for your detailed answer !

I agree that NV is quite good to generalize to different viewpoint and I have seen in it the supplementary video.

My concern is about the 3rd point that you have said above. It seems like NV cant generalize to an object different from the one seen during training. I am just curious if you have ever try to train NV with multiple instances of objects or scenes. To my understanding, your method has 2 main modules: encoder-decoder network and a non-leaning differentiable ray-marching to render the image. So is there any reason that you haven't try to train your network with a dataset which contains multiple set of different objects (these objects can be captured in the same environment and background for example). Unlike recent paper like NERF or SRN, the input to their network is just xyz coordinates but in your case you use neural network to encode and decode scene representation so I think it has potential to generalize across multiple instances of objects and scenes. What is your opinion ?

stephenlombardi commented 4 years ago

We have tried some experiments where we trained a neural volumes model across multiple different objects (in our case it was across different identities). In these cases we were mostly interested in the how well the latent space smoothly captures variations in the space of identity. Note that in the case of different identities, the overall form of the object is similar which may be easier for the decoder to model.

What we haven't done is input new images into the encoder to see if they generate good results. I think it's possible, but the system wasn't really designed for this and we haven't evaluated it. I suspect it would work to some degree, but you would probably have to spend some time trying different architectures to get it to work well. One thing that comes to mind is the fact that a "global" latent code is used to represent the object. If you are trying to get it to work well across many object categories, then it's possible you'd want more of a "local" latent code, where different parts of the latent code correspond to different regions of space, and an encoder that takes that into consideration. You would also probably want an encoder that takes into account the viewpoint of the camera that took the image (DeepVoxels does this in an interesting way). But this is just my guess.