Thanks so much for sharing the code for your amazing work. I had few doubts regarding the disentanglement part of the work:
Is the object-background disentanglement explicit (i.e. using background-foreground masks to train one part of the generator using just background pixels, and remaining parts using foreground pixels) or does the model learn it implicitly. I saw that the paper mentioned that scale and translation are fixed for the background to make it span the entire scene, and to make it centered at origin. But does the model 'unsupervisedly' learn to generate background to feature field generator of this configuration, or is there some explicit supervision also. Paper seems to suggest it's unsupervised, but I just wanted to confirm.
I saw that you have N+1 generators for N objects (1 for background). So are all the N object generator MLPs essentially same generators / shared weights, or are they different. Assuming all objects are say cars, then probably one generator would be okay to generate all the objects, but if we have different objects in scene, like car, bicycle, pedestrian, etc, then probably a per-category object generator would make sense?
Our model learns disentanglement unsupervised - we do not use any form of supervision. The model is trained only with the GAN objective, hence as input it only gets real images - no masks, annotations, etc.
Correct, the object MLPs are shared - your argument is also correct, we only look at the case where the objects are of the same class, e.g. multiple cars, multiple primitives, etc. In case you want to model different object classes, I agree that using per-category MLPs make more sense.
I hope this helps a little. Good luck with your research!
Hello,
Thanks so much for sharing the code for your amazing work. I had few doubts regarding the disentanglement part of the work:
Is the object-background disentanglement explicit (i.e. using background-foreground masks to train one part of the generator using just background pixels, and remaining parts using foreground pixels) or does the model learn it implicitly. I saw that the paper mentioned that scale and translation are fixed for the background to make it span the entire scene, and to make it centered at origin. But does the model 'unsupervisedly' learn to generate background to feature field generator of this configuration, or is there some explicit supervision also. Paper seems to suggest it's unsupervised, but I just wanted to confirm.
I saw that you have N+1 generators for N objects (1 for background). So are all the N object generator MLPs essentially same generators / shared weights, or are they different. Assuming all objects are say cars, then probably one generator would be okay to generate all the objects, but if we have different objects in scene, like car, bicycle, pedestrian, etc, then probably a per-category object generator would make sense?
Thanks again!