Closed flamehaze1115 closed 3 years ago
Hi, All of the methods have different approaches when it comes frame selection or frame sampling. During testing, our goal is to simulate an online multi-view capture that adds keyframes frequently, and predicts depth maps frequently as long as there is a camera motion. Without camera motion, MVS does not make sense obviously. Since the keyframes are added frequently, I think the comparison should be OK unless there is a critical bug. I am sharing the selected frames for getting the results in Table 1: frame-selection-indices.zip. The keyframe selection are blind to the image contents, they are created using the simulate_keyframe_buffer.py script, i.e., looking at camera motion. Such selection clearly improves the results compared to naive sampling, c.f., Table 3.
This is a snippet from first scene in ScanNet test set, scene0707_00: 000020.png 000000.png 000028.png 000000.png 000033.png 000020.png 000038.png 000028.png 000044.png 000033.png 000049.png 000038.png 000054.png 000044.png .........
Time flows vertically. First column is the reference frames that the systems predict a depth map for. Second column is the corresponding single measurement frame to calculate a cost volume or triangulation for ALL METHODS except NeuralRGBD. To be as close as possible to the original NeuralRGBD inference scheme, I completely ignore the second column for NeuralRGBD and using the first column, I create a special sequence for it. By default and as trained weights force, NeuralRGBD requires 4 measurement frames. So, for frame 000033, it takes 000020, 000028, 000038, 000044 frames as measurement frames. For 000038, it takes 000028, 000033, 000044, 000049. And so on...
As I see, you also have a very nice recent work on MVS on video and you evaluate NeuralRBD as well. A major difference between how we and how you report the performance is the separate 0 ~ 5m range, which is an extremely limited range, especially when we consider that NeuralRGBD is trained on ScanNet and the dataset provides depth values up to 10 meters anyway. We do not differentiate between what the maximum range of a method is, which can clearly drop the performance. The separated fashion like you report might be something that we consider in a future revision.
Thanks very much for your detailed explanations!
Your work is great! However, when I see the Table.1, I am surprised that the performance of Neuralrgbd is so low especially for the key metric (\sigma<1.25). Neuralrgbd just takes frames with an interval of 5 as inputs without any frame selection. Do you do frame selection before running the inferences in Table 1? And what's your test split of all the testing files?
Thx