pose.txt coordinate system convention

dimitrisPs commented 1 year ago

Hello,

Huge thanks for sharing this great work with the community.

I was wondering if you could provide more information about the homogeneous transforms in the pose.txt.

Specifically:

Where do the x,y, and z-axis of the camera frame point? For instance, in openCV, the convention is x pointing to the right, y pointing downwards, and z pointing towards the scene.
do those 4x4 matrices describe the transforms from the word frame to the camera or the opposite?

tbobrow1 commented 1 year ago

@dimitrisPs I am glad that you have found the dataset to be useful. I'm happy to provide additional information about the coordinate systems used in the dataset. Image coordinate system: The origin is at the top left corner of the image. Axis "u" points to the right, and axis "v" points down. Camera coordinate system: Follows OpenCV convention. The origin is at the optical center of the camera. Axis "x" points to the right, axis "y" points down, and axis "z" points outward.

Fig. 3 in the paper provides a visual depiction of the coordinate systems.

Regarding the matrices in the pose.txt files, these should be the transforms from the world coordinate system to the camera coordinate system. When the ground truth frames were rendered, the camera origin (O in the paper) was set to the translation component of the transform. Each camera ray (V in the paper) was rotated by the rotation component of the transform and cast into space from the camera origin.

Let me know if you have any additional questions or if I can provide any additional clarification.

dimitrisPs commented 1 year ago

Thanks for the quick and detailed reply @tbobrow1. I was trying to reconstruct the d4v2 sequence in 3D using the depth maps and their corresponding camera poses.

Based on the result I am getting, indeed the camera coordinate system is the same as in OpenCV. As for the transforms, maybe I am confusing notation, but I was only able to align point clouds generated from each depth map, by multiplying them directly with their corresponding pose from the pose.txt.

world_ptc_i_h = posetxt_i @ depthmap_ptc_i_h # (4x4 @ 4xHW), the @ operator is matrix multiplication

This is not intuitive to me because if the contents of the pose.txt describe world-to-camera transforms I thought moving points from the coordinate frame of each camera to the world frame required multiplication with the inverse(world-to-camera) matrices.

tbobrow1 commented 1 year ago

@dimitrisPs, not a problem! To confirm your workflow, are you first back-projecting the depth frames from 2D to 3D using the camera intrinsics, then transforming the 3D points using the transform from pose.txt? If so, are you using the pinhole intrinsics? I will try to reproduce your result tonight to diagnose.

dimitrisPs commented 1 year ago

Hi again,

Yes, this is exactly what I am doing. I am using OpenCV and the fisheye calibration parameters I generated using the calibration samples provided as part of the dataset. I am pasting the model I have got out of the calibration procedure below:

focal length x = 767.3861511125845 center x = 679.054265997005 focal length y = 767.5058656118406 center y = 543.646891684636 k1 = -0.18867185058223412 k2 = -0.003927337093919806 k3 = 0.030524814153620117 k4 = -0.012756926010904904

I am constructing the depthmap_ptc_i_h from my previous message as follows:

1) I compute the image plane points of the fisheye camera using cv2.fisheye.undistort() and the OpenCV fisheye camera model. 2) I convert the image plane points to homogeneous coordinates by appending them with 1. 3) I multiply the homogeneous image plane points with the depth map values which gives me the point cloud of the image. 4) finally depthmap_ptc_i_h is the point cloud of the previous step, appended with a vector of ones converting it to homogeneous coordinates. This allows me to do matrix multiplications with the 4x4 homogeneous transform matrices.

If you go through the same steps and reconstruct points for all frames using the OpenCV function, there may be a few outliers originating from pixels in the periphery of the depth maps. Those probably are due to the use of the openCV camera model instead of the Scaramuzza, or small differences in the calibration procedure in general, but such outliers can be easily filtered out.

Apart from that, if I multiply the homogeneous point clouds with the poses from pose.txt, everything lines up perfectly in 3D which to me shows that the depth and pose information of the dataset is very accurate.

tbobrow1 commented 1 year ago

@dimitrisPs Thanks for summarizing your workflow. I am happy that the data is working with the OpenCV camera model. We also noticed outliers towards the periphery of the FoV when using a pinhole model with radial distortion coefficients. I've found the Scaramuzza model to have higher accuracy across the full FoV, but more difficult to implement with the non-linear forward projection.

I just ran a quick check on my end, and the matrices in pose.txt are camera-to-world transformation matrices. This can be confirmed by forward projecting the depth map and transforming it with the transformation matrix, then overlaying the resulting point cloud on the ground truth mesh in outputMesh.obj (which is saved in the world coordinate system). So your workflow should give you a point cloud in world coordinates. I'll add a note to the web page that specifies that these are camera-to-world matrices.

Thanks for checking in about this!

dimitrisPs commented 1 year ago

Thanks, @tbobrow1 for checking everything and confirming the format of the dataset. And once again thanks for making your work available for everyone :)

DurrLab / C3VD

pose.txt coordinate system convention #3