donydchen / mvsplat

🌊 [ECCV'24] MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images
https://donydchen.github.io/mvsplat
Other
498 stars 22 forks source link

Question about calculating the position of gaussian #41

Closed kevinchiu19 closed 4 days ago

kevinchiu19 commented 1 week ago

Thanks for sharing your great effort and work.

I have a little question here in src/model/encoder/common/gaussian_adapter.py:

Compute Gaussian means.

    origins, directions = get_world_rays(coordinates, extrinsics, intrinsics)
    means = origins + directions * depths[..., None]

The directions used in the calculation here has been normalized, which will cause the calculation result to not be the correct world coordinate system.

For example, in the autonomous driving scene, the xyz of the point cloud is directly used as the gaussian position. Here, normalization and non-normalization are used. The results are as shown in the figure below. image

donydchen commented 5 days ago

Hi @kevinchiu19, I think the pixelSplat team has provided a nice answer to this question at https://github.com/dcharatan/pixelsplat/issues/82, I quote their answer below for the reference of anyone who might share the same concern.

Depth can either be defined as distance along the ray or as Z depth (distance along the camera look vector/Z coordinate in camera space). Since depth is predicted by a neural network, the convention that's being used doesn't matter—the network will simply learn whatever convention is being used.

If you want to switch to the other convention (Z depth), you can replace https://github.com/donydchen/mvsplat/blob/378ff818c0151719bbc052ac2797a2c769766320/src/geometry/projection.py#L105 with the following:

directions = directions / directions[..., -1:]

This will normalize by the Z coordinate instead of by ray length.

kevinchiu19 commented 4 days ago

Okay, thank you again for your great work!

Hi @kevinchiu19, I think the pixelSplat team has provided a nice answer to this question at dcharatan/pixelsplat#82, I quote their answer below for the reference of anyone who might share the same concern.

Depth can either be defined as distance along the ray or as Z depth (distance along the camera look vector/Z coordinate in camera space). Since depth is predicted by a neural network, the convention that's being used doesn't matter—the network will simply learn whatever convention is being used. If you want to switch to the other convention (Z depth), you can replace https://github.com/donydchen/mvsplat/blob/378ff818c0151719bbc052ac2797a2c769766320/src/geometry/projection.py#L105 with the following:

directions = directions / directions[..., -1:]

This will normalize by the Z coordinate instead of by ray length.