autonomousvision / giraffe

This repository contains the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"
https://m-niemeyer.github.io/project-pages/giraffe/index.html
MIT License
1.23k stars 160 forks source link

How the Model Can Achieve Ability to Conditionate on Controllable Params (shape, appearance, etc) #22

Closed ardianumam closed 3 years ago

ardianumam commented 3 years ago

Hi,

Thanks for the awesome work! I'm curious, how does the trained model have the ability to be conditioned on controllable parameters, broken down as follows?

  1. Shape and appearance latent code: is the Discriminator also be conditioned for this shape and appearance aspects? I cannot find it in the code. If it's indeed not being conditioned, then how does finally the model can associate that the latent code of shape is the variable to control the shape in the data generation? Likewise for the appearance latent code.
  2. When sampling the transformation (s, R, T) and also camera pose per batch, do the corresponding real_data also have the similar properties of them (s, R, T and camera pose)? If not, again, how do the model can associate these controllable variables correctly? For example for an "unwanted case", when we sample T so that the generated object will be in the left, but the corresponding real_data when training the Discriminator has the object in the right.

Many thanks.

m-niemeyer commented 3 years ago

Hi @ardianumam , thanks for your interest in our project!

  1. No, the discriminator is not conditioned on the codes. We observe that injecting the two latent codes at different parts of the network leads to shape and appearance disentanglement in an unsupervised fashion - there is no explicit supervision on this!
  2. Correct, the real data has to have similar properties wrt. (s, R, T). However, the nice thing in GAN training is that you compare distributions: Hence, you do not have to have paired data of {image} and {corresponding s, R, T}; we only pre-define a prior over {s, R, T} which roughly matches the data distribution, and can then train without any paired information / direct supervision.