YvanYin / Metric3D

The repo for "Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image" and "Metric3Dv2: A Versatile Monocular Geometric Foundation Model..."
https://jugghm.github.io/Metric3Dv2/
BSD 2-Clause "Simplified" License
1.42k stars 106 forks source link

What system coordinate are depth and normal in? #125

Closed HardikJain02 closed 4 months ago

HardikJain02 commented 4 months ago

To my understanding, depth map is in camera centric. What about normal map? Is it camera centric or world coordinate system?

If camera centric, how can we change it to the world coordinate system?

Context: I am interested in finding the coordinates of the bbox and polygons in the given 2d rgb image. Any other way to find coordinates would be fine as well.

@JUGGHM

JUGGHM commented 4 months ago

Thank you for your interest but this is a monocular system. There is only one frame. Of course, it is under the camera's coordinate.

If we want to obtain the normal in the world coordinate, we will need a rigid transformation from camera to world.

ani0075saha commented 2 weeks ago

Hi @JUGGHM, I am trying to understand the correspondence of RGB values in the surface normal output image to the vector orientation of the surface normal in the 3D world. In other words, what is the mapping of coordinate axes to RGB channels?

I tried a quick experiment with a simple indoor scene with walls and ceilings where the orientation of normals can be done by eye estimate.

empty_room1

This is the surface normal image I get from Metric3D.

empty_room1-normal-Metric3D

Now if I inspect the pixel values,

(1) the floor values are (126, 0, 126) which corresponds to (0.5, 0, 0.5) normal vector if normalized between [0,1] and (0, -1, 0) if I get the values directly from the model output in range [-1, 1]. Because I know that the floor surface normal should point up, this means that it is pointing towards the negative y-axis.

Similarly, for other surfaces

(2) the wall directly infront of the camera has values (0, 128, 127) which corresponds to (0, 0.5, 0.5) normal vector if normalized between [0,1] and (-1, 0, 0) if I get the values directly frm the model output in range [-1, 1]. Because I know that the wall surface normal should point directly at the camera, this means that it is pointing towards the negative x-axis.

(3) the ceiling has values (128, 254, 128) which corresponds to (0.5, 1, 0.5) normal vector if normalized between [0,1] and (0, 1, 0) if I get the values directly from the model output in range [-1, 1]. Because I know that the ceiling surface normal should point directly downwards, this means that it is pointing towards the positive y-axis.

(4) the right wall has values (123, 128, 0) which corresponds to (0.5, 0.5, 0) normal vector if normalized between [0,1] and (0, 0, -1) if I get the values directly from the model output in range [-1, 1]. Because I know that the right wall surface normal should point left, this means that it is pointing towards the negative z-axis.

Based on this interpretation, the coordinate system should look like this. PXL_20241024_003319727~2 This is a left-handed coordinate system.

Whereas, if we switch the first and third channels (or perform RGB2BGR), this is the normal output. empty_room1-normal-Metric3D-rgb2bgr

Now, if we interpret the data, we arrive at this coordinate system (X and Z are switched). PXL_20241024_003542664~2 And this is a right handed coordinate system.

I am trying to use the surface normal data to calculate vector angles for my downstream task. Let me know if my understanding of the coordinate system is correct. This is confusing because other surface normal estimation techniques seem to use different coordinate system -> RGB correspondence conventions.

See https://vision-explorer.allenai.org/surface_normals where the orientation is provided when you calculate the normal. Screenshot 2024-10-23 194137

EDIT: I figured out the coordinate system representation of normals output by the model. It is the second image in the above discussion (z points directly into the scene, x to the right and y points downwards). The cv2 imwrite convention was messing up my understanding. @HardikJain02