Lakonik / SSDNeRF

[ICCV 2023] Single-Stage Diffusion NeRF
https://lakonik.github.io/ssdnerf/
MIT License
432 stars 23 forks source link

Train on custom data for indoor reconstruction #27

Closed XuM007 closed 10 months ago

XuM007 commented 1 year ago

Hi, I wonder to know if it's possible to apply your work to indoor scenes reconstruction. I plan to train the code on Scannet Dataset which provides images for indoor scenes. After training with many scenes from dataset, during test I want to get the room 3D mesh from unseen scene with 4 images taken from 4 corners of a room (not from dataset). I mainly consider two issues.

  1. Your work is on the object and has no background. Do you think it is possible to apply it to indoor reconstruction?
  2. My ultimate goal is to use the trained model to synthesize the room I shot myself through 4 pictures. There is very little overlap in the shooting angles. Do you think this is feasible? If you have any suggestions I'd love to hear them before I start trying. Thanks a lot.
Lakonik commented 1 year ago

Hi! While indoor reconstruction should be theoretically feasible within the single-stage framework, you will have to tune quite a lot of hyper-parameters and maybe modify some modules to make it work, since all our models are optimized for single-object scenes with surrounding cameras.

  1. You probably need to add a trainable background. You can refer to the implementation in torch-ngp as we borrowed their renderer.
  2. 4 views without overlapping should be no problem. SSDNeRF is designed for arbitrary views anyway. However, tuning the guidance/finetuning hyper-parameters requires some patience on a new dataset.

In general, I would recommend starting from reconstructing the training scenes based on this stage1 config, without training the diffusion model. Once you find the NeRF-decoder good enough, you can start over with single-stage training, where you need to adjust the loss weights to balance the diffusion and rendering loss.

PS: Scannet has only over 1500 scenes. This is relatively small for training a diffusion model, and overfitting could be an issue if you train the model from scratch.

XuM007 commented 1 year ago

Thank you so much for your suggestion. In order to prevent overfitting, I will find other datasets such as Replica to construct a larger indoor dataset. Then take experiment as your suggestion. I have another question, do you suggest that I modify the input size in your code according to the aspect ratio of the input image? Thank you very much for your time.

Lakonik commented 1 year ago

The size of the code should be in accordance with the dimensions of 3D scenes instead of the images. The current square triplane code is most definitely not an efficient interior representation. I would recommend referring to some existing interior representations, and you may try integrating some well-established models into the SSDNeRF framework.

XuM007 commented 12 months ago

I conducted preliminary experiments on one NVIDIAA10080GBPCIe GPU. Without considering overfit, a train/valid/test split of 1500/100/20 for rooms was constructed and each scene has 100 views (image shape 128*128). Under this setting, one training step takes about 2 minutes to complete, which is far behind the 0.5 sec achieved by the two RTX 3090 GPUs in the paper. This also meant that my experiment could not be completed in a reasonable amount of time. In order to further develop the experiment, I would like to ask you some possible reasons and any suggestions so that I can run preliminary results in a reasonable time.

Lakonik commented 12 months ago

If you GPU usage is high, then this is probably normal and you only need to wait till around 1k~2k iterations before the occupancy-based pruning strategy takes effect, which speeds up rendering. If you GPU usage is low, then the bottleneck could be I/O or CPU.

XuM007 commented 11 months ago

After reading the code, I found that the reason for the slow training was that the evaluation.interval was not set in the 4-view congig file. This resulted evaluate_3d() in GenerativeEvalHook3D be performed after each iter which is very slow. I noticed that in the 1-view file, the interval is set to be 20k. So I would like to know, in the 4-view config, is it necessary to set this parameter to 1?

Lakonik commented 11 months ago

The 4-view cfg and 1-view cfg are equivalent apart from testing configuration. You can either train everything using the 1-view cfg, or set a desired interval in the 4-view cfg for training.