Question regarding the reference frame for depth from interpolation

raoshashank commented 7 months ago

Hi, I am trying to render a mesh to RGB and then use the interpolate function with the homogeneous mesh coordinates as the attributes to render the corresponding depth image. I am confused as to what coordinate frame and scale the output coordinates would be in. For example, in native opengl, the zbuffer contains the z-coordinates in the camera frame and is expressed in a non-linear buffer. When I try to transform the interpolated coordinates with the object->world and then the world->camera transformation matrices, I get values between 0-1 which don't make much sense, I expect them to be in the range of znear-zfar (camera intrinsics)

    posw = torch.cat([pos, torch.ones([pos.shape[0], pos.shape[1], 1]).cuda()], axis=2) #homogenous mesh coordinates
    T_obj_to_camera = torch.matmul(view_mtx,obj_pose)
    final_mtx_proj = torch.matmul(projection_mtx,T_obj_to_camera)
    pos_clip = transform_points(pos.contiguous(), final_mtx_proj) 
    rast_out, rast_out_db = dr.rasterize(
        glctx, pos_clip, pos_idx[0], resolution=resolution
    )
    gb_pos, _ = interpolate(posw, rast_out, pos_idx[0], rast_db=rast_out_db)
    shape_keep = gb_pos.shape
    gb_pos = gb_pos.reshape(shape_keep[0], -1, shape_keep[-1])
    gb_pos = gb_pos[..., :3]
    depth = transform_points(gb_pos.contiguous(), T_obj_to_camera)
    depth = depth.reshape(shape_keep)[..., 2] * -1

(Reference: https://github.com/NVlabs/diff-dope) Thanks!

s-laine commented 7 months ago

The range of the interpolate() output depends on the data it interpolates, which is posw here. I would expect that transform_points(pos, T_obj_to_camera)[..., 2] * -1, i.e., just transforming the vertices to camera space, would have depth values in the same range? (Assuming that transform_points() is just a matrix multiplication)

In other words, I think the code should work as-is, so maybe I'm not understanding the problem correctly. What kind of znear/zfar values are you using?

As a side note, it would be slightly more efficient to interpolate camera-space depth instead of the full vertex positions, i.e., do the object->camera transform and take the z coordinate, and interpolate only those. Anything that acts linearly in object/world/camera space can be interpolated.

raoshashank commented 7 months ago

I'm using znear = 0.01 and zfar = 100.0, I think I understand the interpolate function now and by plotting the interpolated points it seems to be correct.

However, I would like another clarification: Supposing my task is pose estimation, i.e; to optimize the pose of a mesh given RGB-D + segmentation reference images by matching the rendered RGB-D + segmentation images using gradient descent. Am I correct in understanding that although the rasterization process isn't differentiable w.r.t the z-coordinate, interpolating the xyz mesh coordinates (which themselves are dependent on the pose of the mesh) make the resulting depth image differentiable w.r.t the pose of the mesh?

s-laine commented 7 months ago

Yes, this is correct, although there is still the question of occusion/visibility related gradients at silhouette edges, i.e., where moving an edge reveals or occludes another surface behind it. These become differentiable only via the antialias op — in this use case you'd antialias the depth image. However, it depends on the content if this is necessary, and it is entirely possible that these gradients are not needed for finding the solution.

NVlabs / nvdiffrast

Question regarding the reference frame for depth from interpolation #149