autonomousvision / differentiable_volumetric_rendering

This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"
http://www.cvlibs.net/publications/Niemeyer2020CVPR.pdf
MIT License
798 stars 90 forks source link

Data format of the processed DTU scenes #3

Closed Kai-46 closed 4 years ago

Kai-46 commented 4 years ago

Thanks for sharing your source code! I'm trying to understand the coordinate system used in the provided DTU scenes. However, I'm a bit lost. To transform a 3d point (x, y, z) to a pixel (u, v), three matrices are used: K, Rt and scale. I checked the values of K, Rt and scale; they looked quite different from the usual opencv definition of K, Rt, e.g., the R matrix in the code is not orthonormal. Would appreciate it a lot if some hints about the coordinate system are provided.

m-niemeyer commented 4 years ago

Hi @Kai-46, thanks a lot for your interest! Here some answers regarding your question:

World Matrix (Rt): We use "Rt" directly from the DTU dataset - if you download it, they are provided in "Calibration/cal18/pos_***.txt". If you apply Rt to a point (x, y, z) and get (x', y', z'), you get the pixel locations (u, v) = (x', y') / z'. These are in the ranges [0, W], [0, H], where W, H are the image resolution.

Camera Matrix (K): To be independent of the image resolution, we have the convention that we scale the pixel locations to [-1, 1] for all datasets. This is useful, e.g. for getting the ground truth pixel values in PyTorch with grid_sample. In the DTU case, we then only have to shift and rescale such that the ranges (see above) change to [-1, 1].

Scale Matrix (S): Finally, we use a scale matrix in our project. The DTU dataset does not use a canonical world coordinate system, and hence the objects can be at very different locations. However, we want to center the object / volume of interest in the unit cube. We do this via the scale matrix S. The inverse S^-1 maps our volume of interest from the DTU world coordinates to the unit cube. We did not merge this matrix with Rt to be still able to transform the points to the DTU world.

How to transform pixels to 3D points and vice-versa: We always use homogeneous coordinates such that you can transform a homogeneous 3D point p from "our" world (the unit cube) to pixel coordinates (between [-1, 1]) via first calculating p_out = K @ Rt @ S @ p, and then (u, v) = p_out[:2] / p_out[2]. This is exactly how we do it in the code. The @ means matrix multiplication. For the other direction, you can transform a homogeneous pixel (u, v, 1, 1) to the world by first multiplying it with the depth value: pixel[:3] *= depth_value, and then going in the other direction: p = S^-1 @ (Rt)^1 @ K^-1 @ pixel. Here is the respective code.

I hope this helps a little. Good luck with your research!

Kai-46 commented 4 years ago

Thanks for the clarification! It's really helpful. The definition of K, R, t is quite different from opencv. In opencv, R is a 3 by 3 orthonormal matrix, which I think is aligned with the notion of a rotation matrix. For a 3D point x (3 by 1 vector) in world coordinate system, it's first transformed to camera coordinate system via Rx+t, then projected to image space via (u,v,1)'= K(Rx+t), where K is a 3 by 3 intrinsic matrix containing focal length, principal points. In this project, the Rt looks like the product of K, Rt (augmented to 4 by 4) in opencv (named as projection matrix), while K means normalizing pixel coordinates to (-1,1). This is my first time to see this notation. Thanks again for the explanation.

m-niemeyer commented 4 years ago

@Kai-46 , yes, you are right - for the DTU dataset, it is basically the product because we want to stick to the DTU data. For e.g. the ShapeNet renderings, the matrices should be what you have in might except that we define the image pixels in [-1, 1] instead of [0, W] and [0, H]. I hope this helps. Good luck!

Kai-46 commented 4 years ago

Good luck to you as well! One minor suggestion: if you could add some text describing your convention of coordinate system on the readme, it might help others as well. My personal experience with 3D reconstruction is that coordinate system can sometimes be quite a headache without knowing a priori what convention is adopted, as there seems to be many different conventions :-) Personally, I work with opencv or opengl conventions most of the time.

m-niemeyer commented 4 years ago

Thanks for the suggestion! I had something like this in mind - I will do this when I find time. If you have no further question, you can go ahead and close the issue - thanks!

cortwave commented 4 years ago

Hi @m-niemeyer , I also have some difficulties with cameras format understanding. As I understood from this thread after projection of DTU object points to camera we should get values in [-1; 1]. I wrote following code for testing purposes:

import numpy as np

scan = 118
cameras = np.load(f'differentiable_volumetric_rendering/data/DTU/scan{scan}/scan{scan}/cameras.npz')
points = np.load(f'differentiable_volumetric_rendering/data/DTU/scan{scan}/scan{scan}/pcl.npz')['points']
# to homogeneous coordinates
points = np.hstack([points, np.ones((points.shape[0], 1))])

# let project points to camera 10
idx = 10
world_mat = cameras[f'world_mat_{idx}']
camera_mat = cameras[f'camera_mat_{idx}']
scale_mat = cameras[f'scale_mat_{idx}']

projected = (camera_mat @ world_mat @ scale_mat @ points.T).T
# from homogeneous to 2d
projected = projected[:, :2] / projected[:, 2:3]

As result I got projected values outside of [-1; 1] range. Can you clarify please what I'm doing wrong in the code above?

m-niemeyer commented 4 years ago

Hi @cortwave , thanks for your post.

In theory what you are doing is correct! However, the points you look at

points = np.load(f'differentiable_volumetric_rendering/data/DTU/scan{scan}/scan{scan}/pcl.npz')['points']

is the array of sparse keypoints which is a by-product of Structure-from-Motion. We use this in our project for investigating sparser types of depth supervision instead of a full depth map. In Section 3.5.3 of our supplementary, we write "Another type of supervision which one encounters often in practice is the even sparser output of Structure-from-Motion (SfM). In particular, this is a small set of 3D keypoints with visibility masks for each view mainly used for camera pose estimation."

To train such a model, you have to use one of the ours_depth_sfm.yaml configs from the multi-view supervision experiments.

Now, coming back to your question, you have to filter out the points which are visible in the respective image before projecting them into the view; otherwise, you just project all key points into the view, but many of them will lie outside or could also be occluded. For more details, please have a look at our data field how we process the points.

cortwave commented 4 years ago

@m-niemeyer thank you for your response. I've changed my code according your notices.

import numpy as np

scan = 118
cameras = np.load(f'/home/cortwave/projects/differentiable_volumetric_rendering/data/DTU/scan{scan}/scan{scan}/cameras.npz')
npz_file = np.load(f'/home/cortwave/projects/differentiable_volumetric_rendering/data/DTU/scan{scan}/scan{scan}/pcl.npz')

idx = 10
p = npz_file['points']
is_in_visual_hull = npz_file['is_in_visual_hull']
c = npz_file['colors']
v = npz_file[f'visibility_{idx:>04}']

p = p[v][is_in_visual_hull[v]]
c = c[v][is_in_visual_hull[v]]

p = np.hstack([p, np.ones((p.shape[0], 1))])

world_mat = cameras[f'world_mat_{idx}']
camera_mat = cameras[f'camera_mat_{idx}']
scale_mat = cameras[f'scale_mat_{idx}']

projected = (camera_mat @ world_mat @ scale_mat @ p.T).T
projected = projected[:, :2] / projected[:, 2:3]
print(np.min(projected), np.max(projected))

But it still projects points outside [-1; 1] range. It prints 0.387742314717342 2.184980055074516 for scan 118 and image 10 e.g. Also I have problems with understanding why here you projected points without scale matrix and here you transformed 3d points to unit cube using inverted scale matrix.

m-niemeyer commented 4 years ago

Why do you project here without scale: Because we directly transform npz_file['points'] which, as indicated before, are the sparse keypoints from SfM which "live in the world space"

Why do you transform 3D points using the inverted scale matrix here: As indicated above, "p lives in world space", and scale_mat defines the transformation from "unit cube to world", so if we want to go from world to unit cube, we have to apply the inverse transformation. Here as a diagram:

(Unit Cube) - scale_mat -> (World) - world_mat -> (Camera) - camera_mat -> (Image)

Is my transformation correct: No, for projecting to the image, you would have to do what we do here from L126 to L129, hence you have to remove the scale mat from your line

projected = (camera_mat @ world_mat @ scale_mat @ p.T).T
cortwave commented 4 years ago

Oh, thank you! I think, now I understand these coordinate transformations. I didn't know that points in npz file are already in unit cube. Thank you for clarification!

tiexuedanxin commented 3 years ago

hello, I have create my own dataset, Could you give me some advice on how to calculate the scale matrix, thanks very much.