TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository
https://tri-ml.github.io/packnet-sfm/
MIT License
1.25k stars 243 forks source link

3D reconstruction from multiple images - scale mismatch in depth and pose #213

Open alexander-born opened 2 years ago

alexander-born commented 2 years ago

Thanks for this great repository!

I tried to modify the camviz demo to not only create a pointcloud from a single image but from multiple images.

First I downloaded the pretrained KITTI checkpoints (PackNet01_HR_velsup_CStoK.ckpt).

I modified the ./scripts/infer.py to also calculate the pose like this:

# pose inference
pose = model_wrapper.pose(image, [image_tminus1, image_tplus1]) 

and saved these relative poses from (t -> t_minus_1) additionally to the depth in a npz file.

To create a Pose object from the pose net output I created a transformation matrix with:

from scipy.spatial.transform import Rotation as R
def pose_vec2mat(posenet_output): #posenet_output is the pose from t -> t_minus_1 
    """Convert Euler parameters to transformation matrix."""
    trans, rot = posenet_output[:3], posenet_output[3:]
    rot_mat = R.from_euler("zyx", rot).as_matrix()
    mat = np.concatenate((rot_mat, trans[:, np.newaxis]), axis=1)  # [B,3,4]
    padding = np.array([0, 0, 0, 1])
    mat = np.concatenate((mat, padding[np.newaxis, :]), axis=0)
    return mat

I accumulated these poses with __matmul__ to get all the camera poses. Is the pose calculation correct (it looks good when visualized)?

Then I used Camera.i2w() to project the point clouds of multiple images to world coordinates (Additionally I filtered the point cloud by a max. distance threshold).

It seems like there is a scale mismatch in the output of the depth network and the pose network. This can be seen in the screenshot below, where I am visualizing pointclouds from multiple images (KITTI Tiny). The coordinate systems in the screenshot are the camera poses. You can see the scale mismatch in the duplicated vehicle and that the camera poses moved way too much compared to the pointcloud. Shouldn't these pointclouds of multiple images fit together really good, when using the pose and depth net which are trained together? image

Only if I either scale depth or poses the resulting pointclouds are overlapping (not perfectly): (scaled pose with factor 0.14, no scaling factor in depth): image Another kitti example (scaled pose with factor 0.14, no scaling factor in depth): image

hhhharold commented 2 years ago

Have you tried to 3D reconstruction from multi-view images? Just like the image shown in the README of the camviz

alexander-born commented 2 years ago

Yes this is a 3D reconstruction from multiple views in a global frame. I generated one joint 3d point cloud from multiple images and projectem them into global frame via the relative poses between the images.

In the camviz readme gifs something else is done. There only the point clouds from single image are shown one after the other in the camera frame and not in a global frame (only using the output of the depth net, pose net outputs are not used there). That's how I understood it. Please correct me if this is wrong.

The problem I am facing is that the scale of the pose net output does not match the scale of the depth net. (projections in global frame are not overlapping)

hhhharold commented 2 years ago

I used multi-views images from DDAD dataset to reconstruct 3D scene in world coordinates, but the result was not well. I am checking whether there is an error in my code or the extrinsics is inaccurate.

VitorGuizilini-TRI commented 2 years ago

Can you try using ground-truth information, just to check whether the transformations are correct?

hhhharold commented 2 years ago

Can you try using ground-truth information, just to check whether the transformations are correct?

Yes, there was some mistakes in my code and the extrinsics is correct. By the way, how to set the parameter of pose in draw.add3Dworld function to get a better original visual angle in world coordinates. The default setting in demo is draw.add3Dworld('wld', luwh=(0.33, 0.00, 1.00, 1.00), pose=(7.25323, -3.80291, -5.89996, 0.98435, 0.07935, 0.15674, 0.01431)).

sctrueew commented 2 days ago

Hi everyone, I have a video like this, and I've extracted depth and converted each frame to a points cloud. I would like to display it like this (SLAM). Could you please guide me?

image