EPFL-VILAB / omnidata

A Scalable Pipeline for Making Steerable Multi-Task Mid-Level Vision Datasets from 3D Scans [ICCV 2021]
Other
395 stars 49 forks source link

Camera intrinsics #29

Closed VitorGuizilini-TRI closed 1 year ago

VitorGuizilini-TRI commented 1 year ago

Hi, thanks for the great work! Can you provide some more information about how to parse the point_info parameters into camera intrinsics and extrinsics?

alexsax commented 1 year ago

Hi! Good question!

I just pushed a commit today that contains multiview dataloaders and contains a notebook showing how to unproject multiple depth images into the same pointcloud (using the R/T/K matrices from the point_info). Here's a pic of what's in the notebook:

image
alexsax commented 1 year ago

If this doesn't work, please let me know + reopen!

VitorGuizilini-TRI commented 1 year ago

Thank you, that seems very useful! Unfortunately seems like it uses depth_euclidean, and I'm trying to use depth_zbuffer in my implementation. There is probably an easy way to convert, I'll dig deeper!

But in summary what I am looking for are the traditional pinhole camera intrinsics [[fx,0,cx],[0,fy,cy],[0,0,1]], and your code seems to do everything in NDC, that has a different convention. Do you know how to convert the information from point_info into these values? Then I can provide the zbuffer depth maps + R|t and get the 3D points directly.

alexsax commented 1 year ago

The point_info section contains proj_K[:3,3], which is what you are looking for in NDC space. It works with depth_zbuffer too, you can use proj_K and zbuffer-depth in pytorch3d.cameras.PerspectiveCameras.unproject_points. However, it won't work for Hypersim because Hypersim doesn't use pinhole cameras (explanation below).

As for the different conventions between NDC and the pinhole model you described--I believe the pinhole model you described is in screen space. It should be simple enough to convert between the two--this page describes how to do the conversion, I'm reproducing it here:

The relationship between screen and NDC specifications of a camera's focal_length and principal_point is given by the following equations, where s = min(image_width, image_height). The transformation of x and y coordinates between screen and NDC is exactly the same as for px and py.

fx_ndc = fx_screen * 2.0 / s
fy_ndc = fy_screen * 2.0 / s

px_ndc = - (px_screen - image_width / 2.0) * 2.0 / s
py_ndc = - (py_screen - image_height / 2.0) * 2.0 / s

The reason I use depth_euclidean in the code is because the P3D PerspectiveCamera class is a straight pinhole model, but Hypersim cameras have lens tilt + shift parameters--so the bottom row is not [0,0,1]. I wish Hypersim used pinhole cameras, but the views are artist-generated and the artists added them. I added GenericPinholeCamera class that uses depth_euclidean to handle generic intrinsics matrices, but you could achieve the same functionality using depth_zbuffer as well.

VitorGuizilini-TRI commented 1 year ago

Thank you, that makes a lot of sense! One last question: on taskonomy (I haven't checked the other datasets yet) I cannot find proj_K inside point_info, only field_of_views_rad. I am passing that parameter as input to FoVPerspetiveCameras from pytorch3d, which gives me the NDC intrinsics. Is that what I am supposed to do?

alexsax commented 1 year ago

Yes that should work! That's what I do in the dataloader to get proj_K (and other datasets inherit from this method).

But I'd sugget using/modifying the dataloader rather than trying to read the point_info itself from scratch, as seems like you're doing. The contents can differ between datasets and we spent some time getting the dataloader to work.

VitorGuizilini-TRI commented 1 year ago

Thank you! I managed to get the intrinsics working, now I can unproject with z_buffer and it gives me exactly the same results. I am still having issues with extrinsics, it seems like my codebase has a different convention for frames of reference, but I will keep digging (unless you have any tips for that too).

However, for my application I need the proper camera intrinsics, so you are saying that i shouldn't use Hypersim? Are the other datasets alright for that (i.e., I can use the converted intrinsics to unproject z_buffer depth)?

alexsax commented 1 year ago

Great! The pytorch3d convention is cam_to_world for the [R | t] matrix, so that's what the dataloader returns. But convention issues are always tricky to debug.

And the intrinsics for Hypersim are correct, the K matrix will be exact. It just isn't a pinhole model. Pinhole has 4 degrees of freedom, while Hypersim has 6 (pinhole (4) + tilt (1) + shift (1)).

VitorGuizilini-TRI commented 1 year ago

Ok, I see! So the only non-pinhole dataset is hypersim, right?

alexsax commented 1 year ago

That's right, all the other datasets (Taskonomy, Replica, GSO-in-Replica, HM3D, and BlendedMVS) all use pinhole cameras

wangjiongw commented 1 year ago

Thank you! I managed to get the intrinsics working, now I can unproject with z_buffer and it gives me exactly the same results. I am still having issues with extrinsics, it seems like my codebase has a different convention for frames of reference, but I will keep digging (unless you have any tips for that too).

However, for my application I need the proper camera intrinsics, so you are saying that i shouldn't use Hypersim? Are the other datasets alright for that (i.e., I can use the converted intrinsics to unproject z_buffer depth)?

Hi, Thanks for your guidance. Now I understand that FoVPerspective Camera use FoV to get proj_k to unproject RGBD to point cloud. But since I wanna project the recovered point cloud in camera coordinates to RGB image via K in opencv/open3d which requires (fx, fy, cx, cy) in unit of pixels, can you give me some advice? Thanks in advance

ZachL1 commented 1 year ago

@alexsax

Yes that should work! That's what I do in the dataloader to get proj_K (and other datasets inherit from this method).

Great! The pytorch3d convention is cam_to_world for the [R | t] matrix, so that's what the dataloader returns. But convention issues are always tricky to debug.

Hi! To avoid ambiguity, I think the R,t returned by this dataloader should be called world_to_cam, and this must be used as agreed by pytorch3D: $$X{cam} = X{world} R + t$$

If we want to left-multiply the rotation matrix as usual, the R returned by this dataloader should be called cam_to_world, and it must be transposed first: $$X{cam} = R^T X{world} + t$$ But anyway, the returned t should definitely be called world_to_cam


As you say, it doesn't matter if we use dataloader rather than trying to read the point_info itself from scratch. Because the dataloader actually inverts [R|t]: https://github.com/EPFL-VILAB/omnidata/blob/3242ba7e39efe49b3bf5dae46688702aeeb05f25/omnidata_tools/torch/dataloader/pytorch3d_utils.py#L148-L149 But we can do more interesting and customizable things by using the correct [R|t] directly.

Changing such ambiguous names throughout the project is tedious and labor-intensive, but it was necessary to make this clear, and it took me quite a while to debug this problem, so I'll explain it here.

ZachL1 commented 1 year ago

@wangjiongw

Hi, Thanks for your guidance. Now I understand that FoVPerspective Camera use FoV to get proj_k to unproject RGBD to point cloud. But since I wanna project the recovered point cloud in camera coordinates to RGB image via K in opencv/open3d which requires (fx, fy, cx, cy) in unit of pixels, can you give me some advice? Thanks in advance

Hi, you can extract the intrinsics from the openGL following this tutorial.

Here's how I did it and it worked for me:

P = _get_cam_to_world_R_T_K(point)['proj_K'].numpy()
w = h = 512 # omnidata should all be 512
fx, fy, cx, cy = P[0,0]*w/2, P[1,1]*h/2, (w-P[0,2]*w)/2, (P[1,2]*h+h)/2
wangjiongw commented 1 year ago

@wangjiongw

Hi, Thanks for your guidance. Now I understand that FoVPerspective Camera use FoV to get proj_k to unproject RGBD to point cloud. But since I wanna project the recovered point cloud in camera coordinates to RGB image via K in opencv/open3d which requires (fx, fy, cx, cy) in unit of pixels, can you give me some advice? Thanks in advance

Hi, you can extract the intrinsics from the openGL following this tutorial.

Here's how I did it and it worked for me:

P = _get_cam_to_world_R_T_K(point)['proj_K'].numpy()
w = h = 512 # omnidata should all be 512
fx, fy, cx, cy = P[0,0]*w/2, P[1,1]*h/2, (w-P[0,2]*w)/2, (P[1,2]*h+h)/2

@ZachL1 Thanks for your great help and that really helped a lot! I tried your method and found P[0, 2] & P[1, 2] are zero, so the fx fy cx cy in codes above assume principal point lies in the center of image, right? Also, I found that this intrinsic only works with depth_zbuffer but not depth_euclidean, and reprojected points in camera coordinates don't pair with R|T from given dataloader, which is quite similar with what @VitorGuizilini-TRI met. Do you have any advice towards this? btw, thanks for you all for great help! @alexsax @ZachL1 @VitorGuizilini-TRI

wangjiongw commented 1 year ago

Agree. In the given dataloader, there is a comment shows that pytorch3D requires world2cam extrinsic parameters. I think the keys in returned parameter dictionary are typos.

ZachL1 commented 1 year ago

@wangjiongw Sorry for my late reply, I've been too busy lately.

I tried your method and found P[0, 2] & P[1, 2] are zero, so the fx fy cx cy in codes above assume principal point lies in the center of image, right?

The code snippet I give does not assume anything, it simply converts the projection matrix $P$ to the usual intrinsic matrix $K$. For a pinhole camera, $P$ and $K$ should only differ in representation, and if the original $P$ assumes that the principal point is at the center of the image, then the converted $K$ is the same.

Also, I found that this intrinsic only works with depth_zbuffer but not depth_euclidean, and reprojected points in camera coordinates don't pair with R|T from given dataloader, which is quite similar with what @VitorGuizilini-TRI met. Do you have any advice towards this?

Yes, using $K$ and depth to unproject only works for depth_zbuffer. In my opinion, depth_euclidean is not what we usually think of as depth. As my first comment illustrates, to reproject points from camera to world using the [R|t] returned by the dataloader, you must understand the actual direction of the [R|t] and how you are applying the transformation to the coordinates of the points.

wangjiongw commented 1 year ago

@wangjiongw Sorry for my late reply, I've been too busy lately.

I tried your method and found P[0, 2] & P[1, 2] are zero, so the fx fy cx cy in codes above assume principal point lies in the center of image, right?

The code snippet I give does not assume anything, it simply converts the projection matrix P to the usual intrinsic matrix K. For a pinhole camera, P and K should only differ in representation, and if the original P assumes that the principal point is at the center of the image, then the converted K is the same.

Also, I found that this intrinsic only works with depth_zbuffer but not depth_euclidean, and reprojected points in camera coordinates don't pair with R|T from given dataloader, which is quite similar with what @VitorGuizilini-TRI met. Do you have any advice towards this?

Yes, using K and depth to unproject only works for depth_zbuffer. In my opinion, depth_euclidean is not what we usually think of as depth. As my first comment illustrates, to reproject points from camera to world using the [R|t] returned by the dataloader, you must understand the actual direction of the [R|t] and how you are applying the transformation to the coordinates of the points.

@ZachL1 Thanks a lot! I will keep trying to figure it out. I think we have the same understanding with intrinsic parameters and depth data. I will still test extrinsic parameters later.

wangjiongw commented 1 year ago

@wangjiongw Sorry for my late reply, I've been too busy lately.

I tried your method and found P[0, 2] & P[1, 2] are zero, so the fx fy cx cy in codes above assume principal point lies in the center of image, right?

The code snippet I give does not assume anything, it simply converts the projection matrix P to the usual intrinsic matrix K. For a pinhole camera, P and K should only differ in representation, and if the original P assumes that the principal point is at the center of the image, then the converted K is the same.

Also, I found that this intrinsic only works with depth_zbuffer but not depth_euclidean, and reprojected points in camera coordinates don't pair with R|T from given dataloader, which is quite similar with what @VitorGuizilini-TRI met. Do you have any advice towards this?

Yes, using K and depth to unproject only works for depth_zbuffer. In my opinion, depth_euclidean is not what we usually think of as depth. As my first comment illustrates, to reproject points from camera to world using the [R|t] returned by the dataloader, you must understand the actual direction of the [R|t] and how you are applying the transformation to the coordinates of the points.

Following previous advices from @ZachL1 , I have unprojected RGBD to point cloud in camera coordinate system using depth_zbuffer as depth images and 'K' obtained by the method mentioned in this issue. However, I found that extrinsic parameters following given transformation in dataloader can not transform point cloud of different frames into the correct position.

In other words, given data of the same point but different views, the point cloud should be connected. However, what I obtained two subpart of a point cloud overlapped incorrectly. For more details, if I use the R returned by euler_angles_to_matrix directly, the result would be

image

if I use the transpose the R, the resulting point cloud will be like this. image

But this two part shall be matched and form a whole room.

As for T, the difference of point clouds' coordinates is more than 100 at least, but translation from given dataloader is around 10, I think that is not sufficient to move point cloud part connected.

Overall, as I'm using the point cloud by unprojecting depth_zbuffer, the result coordinates in camera system is different from that in official dataloader. I that the extrinsic in given dataloader is not matched with point cloud I have now. Could you please give some advice? @ZachL1 @alexsax @VitorGuizilini-TRI

I'd appreciate it a lot if any suggestions.

ZachL1 commented 2 months ago

Hi, @wangjiongw ! Sorry for taking so long to reply to you, I wasn't doing 3D related research for a while and I've recently started working on that again.

However, I found that extrinsic parameters following given transformation in dataloader can not transform point cloud of different frames into the correct position.

I believe this would be caused by a different definition of the coordinate system. In Omnidata/Pytorch3D the coordinate system is defined as x-left, y-up, z-forward, so you should be careful how you deal with how your camera coordinate system is defined when performing camera-to-world transformations. Also, the R returned by the Omnidata dataloader should always use right multiplication.

It takes a while to figure it all out. I provide here an example of reading OpenCV format R, T, K from Omnidata, and a successful back-projection to a point cloud. I hope it will be helpful to "OpenCV guys" like me. https://gist.github.com/ZachL1/a3d3227c76228d2a571c4d040886874e

The OpenCV format means:

  1. K=[[fx, 0, cx], [0, fy, cy], [0, 0, 1]],
  2. and the coordinate system is defined as x-right, y-down, z-forward ,
  3. and use left multiplication

that is, 3D coordinates in world can project to 2D homogeneous coordinates by $$x' = K[R | t]X$$