ShenhanQian / GaussianAvatars

[CVPR 2024 Highlight] The official repo for "GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians"
https://shenhanqian.github.io/gaussian-avatars
Other
619 stars 93 forks source link

Why are the pre-processed camera extrinsic parameters different from those in the original NeRSemble dataset? #9

Closed zydmu123 closed 9 months ago

zydmu123 commented 9 months ago

I'd like to know if there is any special adjustment mode to the camera? Thanks a lot!

ShenhanQian commented 9 months ago

Hi, the extrinsics of raw NeRSemble are world2camera in the OpenCV convention, obtained from COLMAP.

Before FLAME tracking, we convert the extrinsics into camera2world in the OpenGL convention, the same as most synthetic NeRF datasets.

We also apply a global rotation to all cameras to align their mean pose with the world coordinate.

def align_cameras_to_axes(
    R: torch.Tensor,
    T: torch.Tensor,
    target_convention: Literal["opengl", "opencv"] = None,
):
    """align the averaged axes of cameras with the world axes.

    Args:
        R: rotation matrix (N, 3, 3)
        T: translation vector (N, 3)
    """
    # The column vectors of R are the basis vectors of each camera.
    # We construct new bases by taking the mean directions of axes, then use Gram-Schmidt
    # process to make them orthonormal
    bases_c2w = gram_schmidt_orthogonalization(R.mean(0))
    if target_convention == "opengl":
        bases_c2w[:, [1, 2]] *= -1  # flip y and z axes
    elif target_convention == "opencv":
        pass
    bases_w2c = bases_c2w.t()

    # convert the camera poses into the new coordinate system
    R = bases_w2c[None, ...] @ R
    T = bases_w2c[None, ...] @ T
    return R, T

After we get FLAME tracking results, we add a global translation to all cameras and the FLAME mesh so that the mean position of the head in each sequence is at the origin.

zydmu123 commented 9 months ago

Thanks for your kind reply!@ShenhanQian As you mentioned above,the pre-processed camera extrinsic parameters have been matched with the FLAME mesh after the tracking process. However, it seems still a slight issue converting your preprocessed camera parameters into a pytorch3d's format. Here is my test, by trying this code I can't get the normal matched results through a PerspectiveCameras, did I miss something important? GS_t

ShenhanQian commented 9 months ago

For your reference, here is a code snippet that works with PyTorch3D on our side

c2w = torch.tensor(frame['transform_matrix'])
c2w[:3, [0, 2]] *= -1  # OpenGL to PyTorch3D
w2c = torch.inverse(c2w).float()
w2c[:3, :3] = w2c.clone()[:3, :3].T  # PyTorch3D uses x = XR + t, while OpenGL uses x = RX + t
self.data["world_mats"].append(w2c)

# construct intrinsic matrix
intrinsics = np.zeros((4, 4))
intrinsics[0, 0] = frame['fl_x'] / frame['w'] * 2
# intrinsics[1, 1] = frame['fl_y'] / frame['h'] * 2
intrinsics[1, 1] = frame['fl_y'] / frame['w'] * 2  # NOTE: the NDC space is a cube, so we use the same scale for x and y
intrinsics[0, 2] = -(frame['cx'] / frame['w'] * 2 - 1)
intrinsics[1, 2] = -(frame['cy'] / frame['h'] * 2 - 1)
intrinsics[3, 2] = 1.
intrinsics[2, 3] = 1.
zydmu123 commented 9 months ago

It works, Thanks a lot!

SSground commented 7 months ago

Hi, the extrinsics of raw NeRSemble are camera2world in the OpenCV convention, obtained from COLMAP.

Before FLAME tracking, we convert the extrinsics into camera2world in the OpenGL convention, the same as most synthetic NeRF datasets.

We also apply a global rotation to all cameras to align their mean pose with the world coordinate.

def align_cameras_to_axes(
    R: torch.Tensor,
    T: torch.Tensor,
    target_convention: Literal["opengl", "opencv"] = None,
):
    """align the averaged axes of cameras with the world axes.

    Args:
        R: rotation matrix (N, 3, 3)
        T: translation vector (N, 3)
    """
    # The column vectors of R are the basis vectors of each camera.
    # We construct new bases by taking the mean directions of axes, then use Gram-Schmidt
    # process to make them orthonormal
    bases_c2w = gram_schmidt_orthogonalization(R.mean(0))
    if target_convention == "opengl":
        bases_c2w[:, [1, 2]] *= -1  # flip y and z axes
    elif target_convention == "opencv":
        pass
    bases_w2c = bases_c2w.t()

    # convert the camera poses into the new coordinate system
    R = bases_w2c[None, ...] @ R
    T = bases_w2c[None, ...] @ T
    return R, T

After we get FLAME tracking results, we add a global translation to all cameras and the FLAME mesh so that the mean position of the head in each sequence is at the origin.

I used the metahuman model. Is it necessary to put the model at the original point?