ardaduz / deep-video-mvs

Code for "DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion" (CVPR 2021)
MIT License
218 stars 29 forks source link

Why cell state 'C' in ConvLSTM Cell doesn't warp? #10

Closed DingYikang closed 3 years ago

DingYikang commented 3 years ago

Hi, Thanks for your nice work. After reading this paper, I'm wondering why the cell state 'C’ doesn't warp to next viewpoint? I didn't find experiments and ablation studies for this problem. Could you please explain this question? Thanks!

ardaduz commented 3 years ago

Hi, We have not experimented with this setup after the given comparison on ConvLSTM vs. ConvGRU in the supplementary. Here, I can only speculate some extrapolations from this existing experiment, and say: Warping both of the states may act similarly to ConvGRU (a single state) case such that removing information completely from non-overlapping regions in between every consecutive timestep can negatively effect the longer term information transport. This is lots of guessing though, any input/ideas from your side is also appreciated.

DingYikang commented 3 years ago

Thanks for your explain! Since removing cell state information completely from non-overlapping regions in between every consecutive timestep can negatively effect the longer term information transport, does warpping hidden state can also negatively effect the longer term information transport? In our experiments in testing DeepVideoMVS, we got bad results when video speed down to 3 FPS. That's may indicates 'warpping hidden state can also negatively effect the longer term information transport'. But for warpping cell state, I guess only experiments can explain it. Finally, thanks for your great work and your detailed reply!

ardaduz commented 3 years ago

For your last sentence: Yes, the effects of warping the cell state (C) can only be observed through further experiments.

Concerning the previous part: We show that having two states is important and actually the KEY idea of our work is "warping the hidden state (H) by leveraging the projective geometry is meaningful and performs much better than not warping". When ConvLSTM cell is used, information transport predominantly happens over the cell state (C), see the ConvLSTM cell definition. In DeepVideoMVS, the hidden state (H) brings an immediate information from the previous time-step in an empirically more effective way (being warped), i.e., transferring the depth encodings of scene contents instead of image contents. So, I do not agree with your statement about the negative effects on the long term information transport.

If the videos with which you're testing are characteristically similar to ScanNet dataset, I would not expect 3 FPS videos to produce really bad results. They might be somewhat worse than being able pick heuristically better keyframes from a 30 FPS video, but they should not be too bad. I can support my claim here with the Table 3 from the paper. In that table, naive sampling of every 10th frame simply corresponds to ~3 FPS, and naive sampling of every 20th frame corresponds to ~1.5 FPS videos. There is surely a performance drop for all of the methods, but all of them still perform OK. If the videos are characteristically very different (let's say the camera operator moves very rapidly), then the visual overlap between consecutive frames can drop quite a bit at 3 FPS. If the visual overlap in between timesteps is small, then warp operation fills out non-overlapping regions with zeros which is removing info. With small overlaps, it becomes harder to do even a simple multi-view stereo, let alone warping in between time-steps. I would suggest you to examine the visual change in between frames of your videos and use a keyframe selection algorithm based on camera poses (or more sophisticated algorithms) to achieve independence from the camera operator and the recording speeds.

Hope these ideas help.

DingYikang commented 3 years ago

Thanks for your detailed reply! It helps a lot. For low-speed-video test, we use the data comes from smart phone, and we'll do more experiments to see the results. Anyway, this is a great work. I appreciate it a lot.

ardaduz commented 3 years ago

I am closing the issue for now, please feel free to open again if you want to discuss further.