ardaduz / deep-video-mvs

Code for "DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion" (CVPR 2021)
MIT License
218 stars 29 forks source link

Question for frame selection #7

Closed LifeBeyondExpectations closed 3 years ago

LifeBeyondExpectations commented 3 years ago

Thank you for sharing this precious work. I have a question about the frame selection.

In the code below, https://github.com/ardaduz/deep-video-mvs/blob/043f25703e5135661a62c9d85f994ecd4ebf1dd0/dvmvs/config.py#L12

You set the hyper-parameters as

train_minimum_pose_distance = 0.1250
train_maximum_pose_distance = 0.3250

I found that while training, there is no overlapping region between frames. And this is especially because of the large motion between views.

Is there any reason that you choose large-motion views? This is a bit different from the video environment since each adjacent frame has relatively small motion.

ardaduz commented 3 years ago

I am not sure if I understand your question correctly. These hyper-parameters control the sampling rate of the frames to be placed in a training subsequence, i.e. keyframes. We empirically found that these two values (and a bit relaxed and strict versions using a multiplier for data augmentation and increased diversity, see https://github.com/ardaduz/deep-video-mvs/blob/master/dvmvs/dataset_loader.py#L151) give a good balance between the amount of overlap and the viewpoint change/baseline between two consecutive frames.

An example from the first training sequence ScanNet scene0000_00 is: Reference Image (t) image

Measurement Image (t-1) image

Reference Pose (homogeneous coordinates) [[-0.9811141 0.05524587 -0.18537323 3.0066042 ] [ 0.19318187 0.3284281 -0.9245624 3.2812254 ] [ 0.00980358 -0.9429118 -0.33289796 1.451942 ] [ 0. 0. 0. 1. ]]

Measurement Pose (homogeneous coordinates) [[-0.99106467 0.05330807 0.12226668 2.873412 ] [-0.0878472 0.42890215 -0.8990695 3.387271 ] [-0.10036808 -0.90177673 -0.42038673 1.4275676 ] [ 0. 0. 0. 1. ]]

Pose Distance Between These Two Frames = pose_distance(reference_pose, measurement_pose) = 0.313

As you see this pose distance is towards the higher end of our empirical range for this particular consecutive pair of frames. And there is still a considerable amount of overlap.

Are you using your own data (isn't it a video)? If you're using custom video data, can you be sure that poses.txt provide camera-to-world poses (not camera extrinsic matrices) --> https://github.com/ardaduz/deep-video-mvs/blob/master/dvmvs/utils.py#L17 ?

If I could not address your concern, it would be nice if you can explain me the issue with an example similar to mine here maybe.

LifeBeyondExpectations commented 3 years ago

Thank you for your precious comment. I also use type of poses (camera-to-world poses) in ScanNet dataset. Let me try more examples personally and close the issue.

Meanwhile, do you have your own ablation study about the relation between depth accuracy and frame selection?

ardaduz commented 3 years ago

In our paper, the frame selection approach is explained in Section 4.2. There is an ablation study (Table 3) on the test time frame selection strategy where several past keyframes are buffered and 15 cm baseline and low relative rotations are preferred while selecting measurement frames for a given reference frame. We compare this approach to the naive sampling of every 10th or 20th frame which are commonly used in the literature. There is no ablation study on the training time selection of the frames.

NoOneUST commented 2 years ago

Hello, can you provide the frame ID sequence on scannet and 7scenes determined by your frame selection strategy? In other words, which frames are actually selected? Or may you tell me which function I can call to create such an ID sequence? Thank you so much!

ardaduz commented 2 years ago

Hi, you can use https://github.com/ardaduz/deep-video-mvs/blob/master/dvmvs/simulate_keyframe_buffer.py. Please set the input folder and output folders, and set the number of measurement frames, and remove the "simple selection" ones if you don't need them.

NoOneUST commented 2 years ago

Hi, you can use https://github.com/ardaduz/deep-video-mvs/blob/master/dvmvs/simulate_keyframe_buffer.py. Please set the input folder and output folders, and set the number of measurement frames, and remove the "simple selection" ones if you don't need them.

I notice that the paper mentions there is a difference between training and testing. If I directly run simulate_keyframe_buffer.py, is the output suitable for training or testing? How should I obtain both? By the way, what is the meaning of n_measurement_frames? How should I set it? I just want to get a reference image sequence, in which each image will be processed one by one and its neighbors will be treated as the source views.

ardaduz commented 2 years ago

simulate_keyframe_buffer.py produces a sequence like this one: https://github.com/ardaduz/deep-video-mvs/blob/master/sample-data/indices/keyframe%2Bhololens-dataset%2B000%2Bnmeas%2B3. Output is suitable for testing an online (we don't know what type of camera motion will happen in the future) depth prediction system. As the README also tells, "In a keyframe file, each row represents a timestep, the entry in the first column represents the reference frame, and the entries in the second, third, ... columns represent the measurement frames used for the cost volume computation [for that timestep's reference frame]."

In a nutshell, keyframe buffer saves, and serves some pose-wise suitable, "nice" past frames to be used as measurement frames. A reference frame uses n_measurement_frames many past frames to compute the cost volume for that timestep. You may observe the difference of setting different n_measurement_frames by comparing these files: https://github.com/ardaduz/deep-video-mvs/tree/master/sample-data/indices.

For training, we sample short subsequences (8 timesteps for this work) out of the training set, where we ensure some minimum-maximum viewpoint change between every consecutive frame. And at each timestep, the reference frame uses the last previous frame as the measurement frame. This subsequence sampling operation is done in https://github.com/ardaduz/deep-video-mvs/blob/master/dvmvs/dataset_loader.py, before training starts. This is closer to what you want as far as I understood your questions. However, I do not have a script that can give you a complete sequence like this out of the test scenes. Please feel free to write one and make a PR if you want.