HKUST-Aerial-Robotics / MVDepthNet

This repository provides PyTorch implementation for 3DV 2018 paper "MVDepthNet: real-time multiview depth estimation neural network"
GNU General Public License v3.0
309 stars 72 forks source link

Question about camera pose and camera intrinsic #2

Closed Smallha61109 closed 5 years ago

Smallha61109 commented 5 years ago

Hello, you have a great project!

I have a few question on using it:

  1. The camera motion provided to the code should be left_image_pose(reference image) - right_image_pose(measurement image) or the other direction?
  2. Does the camera intrinsic changes with image resolution ? Cause I have my camera calibrated with resolution of 640x480, if i want to use your network and don't want to retrain one, i'll have to use 320x256. Do i need to re-calibrate my camera under that resolution?

Thank you very much!

WANG-KX commented 5 years ago

Hello, Thanks for your interest in our work.

  1. The pose 'this_sample['left2right']' is the SE(3) pose of the left camera in the right camera coordinates. Let's say you get left pose T_left and right pose T_right, the left2right is T_right.inverse()*T_left. The left image is the reference image and the right image is the measurement image.
  2. The camera intrinsic parameters changes according to your scale operation to the image. Ignoring the distortion, if your camera parameters are fx, fy, cx, and cy at 640x480, then the parameters are fx x 0.5, fy x 0.533, cx x 0.5, and cy x 0.533 at 320x256. Regards, Kaixuan
Smallha61109 commented 5 years ago

Thank you very much for your reply!

Smallha61109 commented 5 years ago

Hello, I have two more questions here:

  1. How do I scale the output depth back to normal depth map (using mm as unit)? I think the output is normalized to 0~2 (or -1~1) in inverse depth?
  2. How do you obtain the camera pose for input? Cause either I use orb_slam's trajectory or ground truth from TUM dataset can't get a good result, not sure if I am doing any thing wrong here.

Thank you very much.

WANG-KX commented 5 years ago

Dear,

1, The output depth is the inverse depth. Inverse the output you will get a depth map in metric scale. 2, Be careful when using ORB-SLAM because the relative pose here requires metric scale. Monocular mode of ORB-SLAM cannot get the real world scale. In the example data, the camera pose is given by the dataset. TUM is good to use. Later this week, I will upload an example to use online pose estimator such as VINS-Mono to get real time depth estimation. I think if you are doing things right you can get reasonable results.

Regards,

Kaixuan

Smallha61109 commented 5 years ago

Hello,

Sorry but I don't quite understand what is depth map in metric scale? If I use the example model given with the example code, what will the scale be?

Thank you for your quick reply.

WANG-KX commented 5 years ago

Dear,

Assume for pixel x we get the estimation z. The depth of pixel x is d=1/z, the unit is meter.

Kaixuan

Smallha61109 commented 5 years ago

Hello, Just want to make sure that in the example code, the output I should be working on is "idepth" right? If i want to save a non-inverse depth map using mm as unit, I should be doing: save_depth = (1/idepth)*1000

WANG-KX commented 5 years ago

Dear,

I think it's the right way to do.

Regards, Kaixuan

Smallha61109 commented 5 years ago

Hello, Since the output is checked, I've went back to check my input, but can't see any problem. I tried to use TUM fr1_xyz dataset as input, while I take first frame as reference image, and 4th frame as measurement image, their ground truth pose will be: First frame: 1.3405 0.6266 1.6575 0.6574 0.6126 -0.2949 -0.3248 4th frame: 1.3066 0.6256 1.6196 0.6621 0.6205 -0.2892 -0.3050 in quaternion.

Then it is transformed to pose matrix: First frame: [[ 0.07551046 0.61387944 -0.78567948 1.3405 ] [ 0.99701352 -0.03828154 0.06573556 0.6266 ] [ 0.01021044 -0.78835852 -0.61490704 1.6575 ] [ 0. 0. 0. 1. ]] 4th frame: [[ 0.06268622 0.6452541 -0.76146364 1.3066 ] [ 0.9980781 -0.0440261 0.0449838 0.6256 ] [-0.00445364 -0.7627782 -0.64679332 1.6196 ] [ 0. 0. 0. 1. ]]

The 'left2right' will be pose4.inv()*pose1, therefore: [[ 4.74018205e-03 6.12628855e-01 3.53346909e-03 -9.37094866e-01] [ 6.43286514e-01 1.68159404e-03 -5.01337987e-02 2.62950782e-01] [-7.77369044e-03 -3.54228786e-02 3.97621668e-01 3.33813858e+00] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00]]

Does this procedure seems correct to you?Cause the output doesn't look as good as your sample and so is the sc-inv, so I'm not sure what is the problem. Thank you very much for your patient explanation.

WANG-KX commented 5 years ago

Dear,

Actually, the f1_xyz belongs to the training set of the network. I checked the dataset, why the pose of the 4th frame (stamp 1305031102.275326) is not 1305031102.2758 1.3160 0.6254 1.6302 0.6609 0.6199 -0.2893 -0.3086? The pose given by you is 1305031102.3158 1.3066 0.6256 1.6196 0.6621 0.6205 -0.2892 -0.3050 right?

Also, have you normalized the image before the input?

Regards, Kaixuan

Smallha61109 commented 5 years ago

Hello, Thank you for pointing out the issue of timestamp association, it is due to a bug in my association code. However, after fixing the issue it doesn't seem to have much effect on the output. The input is normalized using your suggestion in the README, each rgb image will be subtract by 81 and then divide by 35.

WANG-KX commented 5 years ago

Dear,

I uploaded an examply2.py to show how to process your own data. As far as I see, the two images are too close to each other that the translation is not enough.

Regards, Kaixuan