How to interpret time-to-contact and triangulation depth estimation result?

gengshan-y / expansion

Upgrading Optical Flow to 3D Scene Flow through Optical Expansion, CVPR 2020 (Oral).

https://gengshan-y.github.io/expansion/

MIT License

172 stars 27 forks source link

How to interpret time-to-contact and triangulation depth estimation result? #14

Closed Yibin-Li closed 3 years ago

Yibin-Li commented 3 years ago

First, thank you so much for your great work!

I read about Section 4.5 in the paper and I thought the depth estimation result of triangulation and time-to-contact are absolute scale (e.g. in meter). I was able to run the "demo-expansion" notebook and recreated the 3 exact visualizations (disp0, disp_flow, and disp_p3d), but once I print out the min/max pixel value for each plot, the number seems too small to be absolute scale.

I know that monodepth2 only produces relative depth. What about triangulation (disp_flow) and time-to-contact (disp_p3d) depth result? How to interpret them? Thanks!

gengshan-y commented 3 years ago

Hi, the results of depth from triangulation and time-to-contact are up-to-scale. In fact, they are computed assuming the camera translation magnitude equal to 1. E.g., if you print out the magnitude of T variable, it will be one.

To convert them to metric depth, one may need to first invert disparity to depth, and then multiple the metric scale translation magnitude.

Yibin-Li commented 3 years ago

Thanks for the explanation! I am not sure what do you mean by "first invert disparity to depth". Therotacially I could just plugin disparity and camera projection matrix to the triangulation to get the metric depth, but in your notebook, you plugin the optical flow instead of disparity?

A more board question I want to ask is that for your triangulation and time-to-contact method, if I plugin the groundtruth R and T, I should be able to obtain the metric depth. Is that correct?

gengshan-y commented 3 years ago

To get depth from a pair of flow correspondence (or up-to-scale 3D flow), you need to "triangulate" given camera matrices. This is done in block 27 and 28 of the notebook. Then I take the inverse of triangulated depth and called it "disparity", which is actually inverse depth.

In the above computations, the reference frame camera is in the canonical pose, and the second frame camera translation is assumed to have magnitude of one. If you plug-in the ground-truth R and T, you should be able to get metric depth from triangulation.

Yibin-Li commented 3 years ago

Thanks! It makes sense for the triangulation method.

As for your time-to-contact method, do they correspond to block 28 in the notebook? I am not sure if the up-to-scale 3d flow matches the formula below because it doesn't take the translation vector into the calculation. If block 28 corresponds to the time-to-contact method, which one corresponds to t_cz in the image below? Or it actually refers to something else? I am trying to see where I should plugin in ground truth R and T into the time-to-contact method.

gengshan-y commented 3 years ago

You are right. Block 28 corresponds to another formula (not the one listed). The idea is to instead of triangulating 2D flow, "triangulating" 2D flow + expansion, which gives more information near the epipole than triangulation, and is more accurate than depth from TTC.

Hope the comments in the notebook makes sense. There is more information in Eq. (3) of this paper.

Yibin-Li commented 3 years ago

Got it! Eq. (3) in your "Learning to Segment Rigid Motions from Two Frames" is really helpful here. Thanks for your detailed response!