Closed XuM007 closed 10 months ago
Hi! While indoor reconstruction should be theoretically feasible within the single-stage framework, you will have to tune quite a lot of hyper-parameters and maybe modify some modules to make it work, since all our models are optimized for single-object scenes with surrounding cameras.
In general, I would recommend starting from reconstructing the training scenes based on this stage1 config, without training the diffusion model. Once you find the NeRF-decoder good enough, you can start over with single-stage training, where you need to adjust the loss weights to balance the diffusion and rendering loss.
PS: Scannet has only over 1500 scenes. This is relatively small for training a diffusion model, and overfitting could be an issue if you train the model from scratch.
Thank you so much for your suggestion. In order to prevent overfitting, I will find other datasets such as Replica to construct a larger indoor dataset. Then take experiment as your suggestion. I have another question, do you suggest that I modify the input size in your code according to the aspect ratio of the input image? Thank you very much for your time.
The size of the code should be in accordance with the dimensions of 3D scenes instead of the images. The current square triplane code is most definitely not an efficient interior representation. I would recommend referring to some existing interior representations, and you may try integrating some well-established models into the SSDNeRF framework.
I conducted preliminary experiments on one NVIDIAA10080GBPCIe GPU. Without considering overfit, a train/valid/test split of 1500/100/20 for rooms was constructed and each scene has 100 views (image shape 128*128). Under this setting, one training step takes about 2 minutes to complete, which is far behind the 0.5 sec achieved by the two RTX 3090 GPUs in the paper. This also meant that my experiment could not be completed in a reasonable amount of time. In order to further develop the experiment, I would like to ask you some possible reasons and any suggestions so that I can run preliminary results in a reasonable time.
If you GPU usage is high, then this is probably normal and you only need to wait till around 1k~2k iterations before the occupancy-based pruning strategy takes effect, which speeds up rendering. If you GPU usage is low, then the bottleneck could be I/O or CPU.
After reading the code, I found that the reason for the slow training was that the evaluation.interval
was not set in the 4-view congig file. This resulted evaluate_3d()
in GenerativeEvalHook3D be performed after each iter which is very slow. I noticed that in the 1-view file, the interval is set to be 20k. So I would like to know, in the 4-view config, is it necessary to set this parameter to 1?
The 4-view cfg and 1-view cfg are equivalent apart from testing configuration. You can either train everything using the 1-view cfg, or set a desired interval in the 4-view cfg for training.
Hi, I wonder to know if it's possible to apply your work to indoor scenes reconstruction. I plan to train the code on Scannet Dataset which provides images for indoor scenes. After training with many scenes from dataset, during test I want to get the room 3D mesh from unseen scene with 4 images taken from 4 corners of a room (not from dataset). I mainly consider two issues.