Brummi / MonoRec

Official implementation of the paper: MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera (CVPR 2021)
MIT License
587 stars 85 forks source link

DVSO keypoints as metric depth? #13

Closed morsingher closed 3 years ago

morsingher commented 3 years ago

Hi, thanks for sharing this amazing work.

I have a question on the DVSO keypoints you provide in the readme. As far as I understood, they actually contain disparity values, which you convert to depth as follows:

https://github.com/Brummi/MonoRec/blob/671fec524fec0b21720d098f14ee18faab9d291d/data_loader/kitti_odometry_dataset.py#L162

The conversion depth = (baseline * focal) / disparity is fine for me, but I don't really understand how the width and the 65535 interact in this formula. Could you clarify a bit better how to obtain metric depth values from these keypoints?

Thank you in advance.

nynyg commented 3 years ago

Hi @morsingher, thanks for your interest in our work.

Sorry for the confusion about the conversion. The result you get from this formula is the inverse of the metric depth. So if you want to get the metric depth, you just need to do an inversion.

About how the values are stored. As you said, the values we stored are indeed disparities, but they're normalized by the width of the image, so

norm_disp = f_x * b / (d * width).

This way, the norm_disp is always in [0,1] and invariant to the width of the image. Then, since we use 16-bit png files as the storage of the depth maps, we further multiply the norm_disp with 65535 to save the values as integers, so

stored_value = 65535 * norm_disp.

Hope the above explanation helps. Please let us know if you have further questions.

morsingher commented 3 years ago

Hi @Yelen719,

thanks for the quick answer. So, just to sum up, I should load the png with PIL and perform the following steps:

  1. norm_disp = stored_value / 65535, which correctly gives values between 0 and 1.
  2. disp = norm_disp * w, and this is measured in pixels.
  3. depth = (f_x * b) / disp, which should be the metric depth.

It seems fine now, thanks. The only issue I'm facing is that some of these keypoints (I would say around 15/20% for each image on KITTI 04) have depth values much greater than 80 m. Of course I can just ignore them, but I'm wondering if you also use these ones during training?

nynyg commented 3 years ago

Yes, we also use them for training.

nynyg commented 3 years ago

I will close this issue for now. If you have further questions, feel free to reach out to us.