facebookresearch / NeRF-Det

[ICCV 2023] Code for NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
https://chenfengxu714.github.io/nerfdet/
Other
279 stars 18 forks source link

Training time cost #10

Open OrangeSodahub opened 1 year ago

OrangeSodahub commented 1 year ago

I was wondering the time cost for training, with specific configurations. Is the nerf-branch time-consuming?

chenfengxu714 commented 1 year ago

The training is very efficient, it takes only around 7-10 hours on 4 V100.

OrangeSodahub commented 1 year ago

@chenfengxu714 That's amazing. Let me double check, for the training process, you trained on the full scannet train dataset (~1000 scenes), and per iteration per scene, and each scene uses 20 views images as the input (in nerf branch and det branch), cost 16 GB memory.

I wonder in the inference, can it input many enough images (even all regardless the memory) to represent the entire scene well (nerf branch).

chenfengxu714 commented 1 year ago

Yes, you can. But I suggest using our advanced config that is inspired by our previous work SOLOFusion, i.e., using more frames but low-resolution images. This is more effective and efficient. But if your memory is sufficient and you do not care about latency, just use as many frames and high resolutions as possible.

OrangeSodahub commented 1 year ago

@chenfengxu714 I saw in your configs just randomly choose 50 images per scene, and randomly choose 10 target views, (means other 40 are source views). In 'volume' mode you build scene volume from 40 source views and render 10 target views. And you caculate the number of points seen by source views along the rays from target views to mask part of them, I wonder is that reasonable to render a ray even if not all points are seen along this ray?

And in testing, how could you make sure that 10 target views totally convered by 40 source views?

chenfengxu714 commented 1 year ago

Good question. We do not deal with this issue, and this is indeed a difficult one. We did find that if the views are sparse, the NVS is much worse, i.e., much cases do not target points projecting to source views. Then this is reduced to vallina NeRF which can be represented as "density, h = mlp(pos_enc) -> rgb = mlp(h, view_dir)". This is also why our solofusion tricks help since we reduce resolutions but largely increase the views.

OrangeSodahub commented 1 year ago

@chenfengxu714 I wonder if you could share the training logs of mmdet, like the loss value, etc.

chenfengxu714 commented 11 months ago

Sorry I did want but I can’t access the workstation I used now. I will reclean the code and polish everything of the code and experiments on my own machines after CVPR.