Harry-Zhi / semantic_nerf

The implementation of "In-Place Scene Labelling and Understanding with Implicit Scene Representation" [ICCV 2021].
Other
426 stars 56 forks source link

Processing a dataset with a different camera model #32

Closed wave-transmitter closed 1 year ago

wave-transmitter commented 1 year ago

Hey,

once again congrats for the amazing work.

I am trying to train the semantic-nerf in a custom dataset. I have pre-processed the dataset to meet the format of replica dataset, classes are also familiar, so I have managed to train the model without tuning the data loader (replica_datasets.py). Yet the model cannot learn the 3D representation due to camera poses incompatibility.

In specific, as far as I can understand from trainer.py and set_params_replica function, the rgb images are supposed to be captured via a pin-hole camera. In my case, the camera model is different so I want to modify the function so as to process the camera poses of the traj_w_c.txt file correctly. Is it enough changing fx, fy, etc. parameters? What if there is a more complex camera model with additional parameters such as p?

Any tips or ideas on how to implement a different camera model than pin-hole are more than welcome!

Harry-Zhi commented 1 year ago

Hi @wave-transmitter,

Thanks for your interests. Could you give more information about the camera models you are trying to do. If you are still using a typical pin-hole like camera, changing the fx,fy, cx and cy accoding to your case should be fine.

If you are using different models like omni-image, fish-eye or event-camera, probably you could refer to recent paper like: https://cyhsu14.github.io/OmniNeRF/, https://arxiv.org/abs/2206.11896 and https://4dqv.mpi-inf.mpg.de/EventNeRF/.

wave-transmitter commented 1 year ago

Hi again,

I am somehow confused to be honest among the camera model that actually captured the training data and the simulated camera model deployed in your code. How are these two models correlated? Assuming that we have a set of images captured via a real-life camera and the corresponding camera poses, should this camera implemented in your code, similarly to set_params_replica function or the assumption of pin-hole camera is enough to train the semantic-nerf?

Let me share some extra thoughts.

  1. In case of replica dataset you mention that the images are captured via habitat-sim, where a pin-hole camera is implemented to acquire 640 X 480 images. Based on this resolution and camera model the traj_w_c.txt is also provided via habitat-sim. However, before training semantic-nerf the acquired images are rescaled to 320 X 240 and fx, fy parameters of the supposed pin-hole camera are calculated based on this resolution. Doesn't this affect the accuracy of camera poses since there are calculated based on a camera model of 640 X 480 resolution?

  2. Assuming that we have a custom dataset acquired via a real-life camera and via a tool like nerfstudio we extract the corresponding camera poses with the corresponding parameters fx, fy, etc. In case that we follow the original pipeline and the input images are rescaled to 320 X 240, what should we do with the rest parameters, like fx, cx, etc? Should they be changed according to the initial estimations or stick to calculations of the code, i.e. Line62 - Line68 in trainer.py? What if there are additional intrinsic parameters, such as distortion parameters k and p?

Harry-Zhi commented 1 year ago

Hi @wave-transmitter, camera intrinsics of the data during capturin as well as NeRF training in principle should be the same, which is what happens in our code base for Replica and ScanNet. The pin-hole camera intrinsics used in our code for either Replica or ScanNet are consistent during their respective learing process.

If you capture new data via a new camera device or from another source of dataset, then you need to modify/adjust the camera intrinsics.

For all above discussion I only talk about $f_x$, $f_y$, $c_x$ and $c_y$ without distortion parameters as I assume images are already un-distorted in advance.

For your question,

  1. Scaling the images shares the same effect or scaling camera intrinsics, and does not affect the camera poses/extrinsics in-princile.
  2. I think when you scale the images, you should scale all $f_x$, $f_y$, $c_x$ and $c_y$ correspondingly. If you only crop a large portion of images, then only $c_x$ and $c_y$ are needed to change. As to distortion parameters, I did not have a final answer but I may first un-distort and images before further scaling/croping then I did not need to take care of $k$ and $p$ anymore.

If you are interested in learning calibration and NeRF model, then SC-NeRF may be what you need.

wave-transmitter commented 1 year ago

Hi @Harry-Zhi,

thanks for the clear explanation and the additional material!