ardaduz / deep-video-mvs

Code for "DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion" (CVPR 2021)
MIT License
218 stars 29 forks source link

Testing Custom video sequences #11

Closed aakash26 closed 3 years ago

aakash26 commented 3 years ago

Hi authors, Thanks for providing the code and all the information. The online testing script is working great on the provided Sample HOLOLENS dataset and on TUM-RGB-SLAM dataset and giving great results. However, now I want to try to run on custom videos taken from a smartphone. I have one question regarding this:- 1) I am using ORB-SLAM to predict camera poses, but it takes around 35-45 mins runtime on the GPU, do you have any advise running a faster algorithm for calculating camera poses any other algorithm ?

Thanks and hoping for a reply Aakash Rajpal

ardaduz commented 3 years ago

Hi, Surely you don't need groundtruth depth maps to run our algorithm, it's against the whole purpose :), just my implementation did not allow it. Now, I added an option to run the online testing of fusionnet without having the groundtruth depths for evaluation. You can pull the changes and use evaluate = False here.

One crucial requirement for the videos is the metric pose measurements. There must be no scale ambiguity, otherwise the depth planes used for the plane sweep stereo won't match the training behaviour and the system most likely will produce inaccurate results. I have two options which comes to my mind quickly.

  1. As we discuss in the introduction of our paper, you should record your data with pose information using monocular visual inertial odometry (VIO) techniques which alleviates the scale ambiguity greatly and they are widely available in mobile platforms. Since you say a smartphone, I can suggest you to write a simple code for Android or iPhone with ARCore or ARKit respectively. Both of these libraries provide metric pose information while working with VIO under the hood, check [here](https://developers.google.com/ar/reference/java/com/google/ar/core/Camera#getPose()). Writing a code for Android to record the necessary data is among my TODOs but it is not a priority so I can't specify a time.
  2. If possible, you can replace your single smartphone with some stereo setup with known baseline distance (any prior pose/scale information is fine) and adapt ORB-SLAM2 for online stereo SLAM or COLMAP for offline SfM. Just a side note about ORB-SLAM, the runtimes should not be that high as far as I know, all in all it is presented as a real-time SLAM library. Also, it seems like you don't need online pose estimation, so I would suggest COLMAP to get more accurate poses and I personally find the COLMAP implementation easy to work with to adjust it to my needs like adding stereo rig optimization or any pose priors.

Hope these ideas somehow help you.

aakash26 commented 3 years ago

Hi @ardaduz ,

Thanks for replying so quickly, Yes, I understood that the depth predictions were only needed for testing, but I will pull the latest code for evaluation. For camera poses, I was using ORB-SLAM for online Monocular evaluation but I agree with you. I have a stereo setup and would use that today hopefully and try to evaluate using COLMAP first. Thanks for the input will let you know the results hopefully :)

ardaduz commented 3 years ago

I am closing the issue for now, please feel free to open again if you want to discuss further.