facebookresearch / TemporallyConsistentDepth

Code for our CVPR 2023 paper on online, temporally consistent depth estimation.
Other
85 stars 3 forks source link

Scaled Depth -- using Ground Truth Depth to Correct Mono Depth at Inference #5

Closed TouqeerAhmad closed 6 months ago

TouqeerAhmad commented 1 year ago

Hello,

I was going through the code and noticed that in data loaders for both ColMap and ScanNet (referenced below), during inference the ground truth depth map is being loaded to provide the scale and then the monocular depth predicted via DPT is being corrected. I understand that temporal/spatial fusion requires scaled depth, but this seems rather cheating as to use the ground truth depth itself to correct inferred depth, this is in addition to using GT camera pose and intrinsics.

https://github.com/facebookresearch/TemporallyConsistentDepth/blob/be85390cf5db72a996bebba3d9f34439f1576196/datasets/scannet.py#L54

Even if one employs SLAM/ICP as suggested in paper's conclusion to get the camera R|T and have K available as well, one still needs to have scaled depth which is unavailable via DPT or other monocular networks, and relies on GT -- makes the applicability of method limited to only situations where correct metric depth is already available; not deployable in real world.

I was wondering if you have any comments regarding this.

Thanks!

nkhan2 commented 11 months ago

Hi TouqeerAhmad, For real world situations sparse depth points for scaling can be obtained using SLAM or dense/semi-dense tracking. With greater control over the hardware, one could also use a ToF or LiDAR sensor. In fact, the ScanNet dataset only has "ground truth" depth and camera poses to the extent that they were captured with a depth sensor and pre-calibrated.

Hope this helps.

TouqeerAhmad commented 11 months ago

The LiDAR samples are not generally accurate, depth completion usually focuses on generating dense depth while correcting the erroneous sparse samples. That becomes a different problem i.e., solving depth completion and then using TCOD for temporal consistency. Here, assumption is DPT/monocular (already dense) and then relying on ground truth samples to correct seems weird. SLAM/SfM would also require pretty good motion and have issues of their own, again limiting the applicability and/or invalidating the scale. Thank you for your input on this!