Closed kwea123 closed 3 years ago
Great question! I did test NeRF++ on forward-facing scenes; it worked just as well as, if not slightly better than, NeRF's NDC mapping in my experiments. Back to your experiment, I guess the reason why NeRF++ puts everything on the fg is that your unit sphere already includes a significant portion of the background, which leads to resolution issues. Maybe you can double check this?
On the other hand, if you would like to end up in a similar situation as NeRF's NDC mapping, I would suggest that you put the scene origin at the midpoint of your 0th and Nth cameras, and normalize the distance between 0th and Nth camera to be 1.2 or 1.1.
In fact, from a more theoretical viewpoint, NDC mapping is a special case of the inverted sphere parametrization we use in NeRF++. The way to link these two parametrizations is to think that: in NDC mapping, cameras and scenes are separated by a plane, while in inverted sphere parametrization, cameras and scenes are separated by a spherical plane. Hence, our inverted sphere parametrization should have no problem handling the case targeted by NDC mapping, if you put the unit sphere properly in the scene, just like you need to set near plane properly when you use NDC mapping. Does this make sense?
your unit sphere already includes a significant portion of the background
Yes I have doubt on this as well, but I'm not sure to which point it affects training, because there are still a big portion (like the sky) that I'm sure that lies at infinitely far and should be treated as bg.
NDC mapping is a special case of the inverted sphere parametrization
I believe this is true, but in my opinion the sphere tuning is less trivial because even for human it is difficult to determine where is the fg/bg boundary... and if it's not correctly set the performance gets worse. While in NDC, we only need to set a plane.
I would suggest that you put the scene origin at the midpoint of your 0th and Nth cameras, and normalize the distance between 0th and Nth camera to be 1.2 or 1.1
To expand my above statement, I experimented different fg/bg boundary parameters around your suggested values.
This value controls the size of the camera poses in the unit sphere. It turns out that if this value is big (i.e. the first and last cameras are close to the sphere), then everything is learnt as bg, on the contrary if this value is small, everything is learnt as fg, and somewhere in the middle, we can get fg/bg separation as Fig. 7 in the paper. This phenomenon is as expected of course, but the problem is how to determine the good value beforehand? It has no reason to be the same across different scenes, and doesn't seem that it can be easily determined just by observing the images. Another thing I found is, it seems that fg/bg split is totally random: with the exact same hyperparameters sometimes it learns all as fg, sometimes all as bg, sometimes fg with bg... Are your results always consistent? That is, with the exact same hyperparameters, does the model always learn roughly the same fg/bg split?
Some quick comments: 1) putting everything in the bg land you exactly in the situation of NDC mapping, because now the sphere plane functions similarly to NDC's camera-scene separation plane. In fact, you can always choose the sphere plane to roughly align with NDC's near plane to make NeRF++ behave like NeRF for this kind of forward-facing capture (in this case, you first determine the sphere plane, then determine the sphere center accordingly; intuitively speaking, you can view NDC's view frustum as approximately being a cone of the inverted sphere parametrization.). 2) Determining the sphere according to images is not a recommended practice in general. 3D sparse point cloud output by SfM system is a good help. In fact, the near plane of NDC is also set with the aid of 3D sparse point cloud. 3) I did not pay attention to the random split between fg/bg, because at the end of day, image quality is the top priority in novel view synthesis. Fully understanding the random split might involve a deeper question that hasn't been addressed in literature: is the geometry learnt by NeRF++/NeRF correct from the quantitative perspective, or they just "look" correct? For the task of novel view synthesis, it's perfectly okay to tolerate somewhat "incorrect" geometry as long as the synthesized images are good, but the "incorrect" geometry will very likely cause random fg/bg splits you observe here. The shape-radiance ambiguity presented in our arXiv preprint sheds some light on this point, hopefully.
FYI, I just added an interactive viewer for visually inspecting the normalized cameras. Check the readme. Hope it helps in determining a proper normalization.
So do you also observe this random fg/bg split in your experiments on NeRF forward facing scenes (not 360 scenes which I think fg/bg should be split roughly correct)? I agree that if everything is in bg, than there's no much difference with NDC, and if the images look good there is no problem.
I am currently investigating on other possible extensions, such as incorporating optical flow/rigid flow, normal estimation, and object insertion using ray tracing. These extensions are easy to integrate if everything is operated in real space, but much more complicated if the space is particular like NDC or inverted sphere. That's why I resort to your great work to see if I can make the network learn in real space, but it turns out that it doesn't always output a good fg, so it would be difficult to directly apply your method in my research.
Does this method only work for 360 unbounded scenes? Does this work on, for example, forward facing scenes in NeRF? Has anyone tested? I currently tried applying this on a driving scene, where the images are photos taken from a forward-moving car. I defined the sphere center as the last position, and the radius as 8 times the distance travelled (like for T&T dataset), poses are like the image below.
When I use NeRF, it works well with the NDC setting since everything lies inside the frustum in front of camera 0. However with NeRF++, it fails to distinguish the foreground(fg) and background(bg): when I check the training output, it learns everything as fg and the bg is all black. And since the faraway scenery is bg, it learns it very badly. I therefore have question if it only works for 360 unbounded scenes, where the fg/bg is easier to distinguish?