network acceleration and generalization

mwsunshine commented 4 years ago

hi,

Thank you for your impressive work at first! I am a big follower of your research at LLFF as well, and I am more than glad to witness the huge improvement you havd made!

Meanwhile, as many comparisons are made while I am reading this one, here are some questions that I am a little bit comfused and might need your explanations:

what's the advantage that you select a fully-connected network rather than a deep convolutional network? As far as I know, the net in Nerf needs to be retrained if the input data is changed. While in LLFF, the network is once and for all: the network could split out defined number of depth planes automatically once trained.
Have you ever considered to speed up the training process? To get the first output video(50000 iterations by default), it takes about 2.5~3 hours on 2080Ti. Given the time needed for training, Nerf network could be barely able to be embeded into a smart phone. I am wondering whether it is possible to be accelerated? For example, distinguish user cases as several labels like plants, people, building, etc. and pre-train base models for each case. Then finetune the base models with inputs.(just an example, maybe not make sense).
Have you ever tried similar ideas in Nerf to LLFF to improve the performance? If yes, how was the result; if no, what's the difficulties? Similar ideas like positional encoding, or set (x,y,z,theta, phi) as inputs are very likely to be useful in LLFF.
Precision of COLMAP I notice that both nets use COLMAP to get the initial poses for images. What if the poses are slightly different from the ground truth (I came up with these cases several times)?

Thanks again for your sharing of your excellent work!

kwea123 commented 4 years ago

I'd like to provide my opinions on some of your questions. I've exchanged many ideas with the author in the issues, so I think I understand his concept to some degree.

LLFF and NeRF are two totally different appoaches, LLFF is trained on many images data so that it is able to predict the MPI representation of a unseen scene; it is also why it uses cnn to extract image features to get better matching result. However NeRF is based on rays, so there is no direct necessity to use a cnn since rays are not necessarily spatially correlated (it could be a good research to exploit the spatial correlation on adjacent rays though). As a result a MLP is chosen for its simplicity.
The training is slow here simply because he doesn't optimize the code much. Other people and I have implemented faster (1.3x~1.5x) versions in pytorch. You can refer to #15.

Nerf network could be barely able to be embeded into a smart phone.

By that you mean the inference time? Currently it is very slow because it generates detailed rgba for each ray. There is no direct way to accelerate based on NeRF code (there is a similar concept in this paper that accelerates ray marching, maybe that would help), however it is possible if you compromise on quality: you can generate a fixed low resolution volume with rgba values, then render it using traditional volume rendering technique. It can be a lot faster (>100FPS) with still promising visual quality. I have some examples in my repo Finally I think NeRF needs to be trained from scratch for every scene, so there is no sense of a "base model" or "finetuning".
I was also worried about the image distortion #35 . Currently NeRF assumes perfect camera calibration with fx=fy=f and cx=fx/2, cy=fy/2 without distortion. Obviously this is not true for real camera, but the training result is still good, so I believe that it can tolerate some small errors on intrinsics, so maybe also extrinsics.

mwsunshine commented 4 years ago

I'd like to provide my opinions on some of your questions. I've exchanged many ideas with the author in the issues, so I think I understand his concept to some degree.

LLFF and NeRF are two totally different appoaches, LLFF is trained on many images data so that it is able to predict the MPI representation of a unseen scene; it is also why it uses cnn to extract image features to get better matching result. However NeRF is based on rays, so there is no direct necessity to use a cnn since rays are not necessarily spatially correlated (it could be a good research to exploit the spatial correlation on adjacent rays though). As a result a MLP is chosen for its simplicity.

The training is slow here simply because he doesn't optimize the code much. Other people and I have implemented faster (1.3x~1.5x) versions in pytorch. You can refer to #15.

Nerf network could be barely able to be embeded into a smart phone.

By that you mean the inference time? Currently it is very slow because it generates detailed rgba for each ray. There is no direct way to accelerate based on NeRF code (there is a similar concept in this paper that accelerates ray marching, maybe that would help), however it is possible if you compromise on quality: you can generate a fixed low resolution volume with rgba values, then render it using traditional volume rendering technique. It can be a lot faster (>100FPS) with still promising visual quality. I have some examples in my repo Finally I think NeRF needs to be trained from scratch for every scene, so there is no sense of a "base model" or "finetuning".

I was also worried about the image distortion #35 . Currently NeRF assumes perfect camera calibration with fx=fy=f and cx=fx/2, cy=fy/2 without distortion. Obviously this is not true for real camera, but the training result is still good, so I believe that it can tolerate some small errors on intrinsics, so maybe also extrinsics.

hi, kwea123,

Appreciate your quick response, detailed explanation and related work (pytorch version)! As for your replies, some are clear while for others I might need your further explanation:

my first two questions actually origin from the same target: After capturing several images, how could I get the new synthesis views with just the phone in acceptable time, like what LLFF does. To achieve this target, I could think up with two methods:

A. train the model offline but do inference on a smart phone online(what LLFF does). That's the reason why I asked the first question: why give up this method? I cannot understand your comment "Finally I think NeRF needs to be trained from scratch for every scene" and might need your further explanation here. Because from my understanding, models could share basic features for similar tasks (that's why transfer learning works and the same vgg net could be used to catch up with higher dimension features for almost every scenes), it's not necessary to start from scratch for everything. Meanwhile, rays are correlated if they are from the same object.

B. If methd A is not feasible for some reason, speed up the training process to an acceptable level of time then do inference. That's the reason why I asked the second question. So "Nerf network could be barely able to be embeded into a smart phone" does not mean inference, but the whole process (specifically saying, from inputs of images to final new views). Using the version of pytorch might not be acceptable as well.

As for the problem of perfect camera, the main camera we use in our cellphone is tele camera in general case, which is less distorted compared with wide or ultra-wide cameras and the distortion could be negelected.

kwea123 commented 4 years ago

A.

models could share basic features for similar tasks

No, NeRF is not like main-stream feature matching which generalizes, it's more like "overfitting" to a certain object. It trains on many images of the same scene so that it knows how the scene is composited in 3D, but only this scene! For example its inference only does: "given only xyz in 3d (inference requires no image, no anything else!), infer the occupancy score", there is no reason this can generalize to other object...

I can only think of one case where training from scratch is not required: when the trained objects are very similar to your object. Like the paper above and many others, for example they train on many car models and do inference on also car model, in this case acceleration might be possible, but in my opinion that's too restrictive (it only works with objects with similar shape and similar 3d position) and is not the target of NeRF.

Yes rays are correlated, but NeRF just shows that it also works without correlation. Adding the correlation might work better/faster and is left for future research.

B. Train the whole process including inference in real time is impossible like I described above, this method is just not suited for that...

mwsunshine commented 4 years ago

Many thanks for kwea123 for what you have shared! That's valuable and informative!

For the questions we havn't discussed, I am wondering whether the author could give explanations; For those we have discussed, I am wondering whether the author could give confirmation?

akshayxml commented 3 years ago

Nice discussion @mwsunshine @kwea123 . Since it has been more than a year since this discussion happened, has there been any paper which tried to improve NeRF in the way @mwsunshine suggested? i.e., which just needs to be trained once and then can be used on any scene without re-training?

mwsunshine commented 3 years ago

Nice discussion @mwsunshine @kwea123 . Since it has been more than a year since this discussion happened, has there been any paper which tried to improve NeRF in the way @mwsunshine suggested? i.e., which just needs to be trained once and then can be used on any scene without re-training?

I am not very much focused on this area since then. Novel view synthesis is still a hot topic and you can look through the rescent paper. Most of them are once and for all. Nerf is quite unique

kwea123 commented 3 years ago

There are a lot now. Take a look https://github.com/yenchenlin/awesome-NeRF#generalization

bmild / nerf

network acceleration and generalization #54