dcharatan / pixelsplat

[CVPR 2024 Oral, Best Paper Runner-Up] Code for "pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction" by David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann
http://davidcharatan.com/pixelsplat/
MIT License
837 stars 57 forks source link

Question on generate point cloud figure code. #25

Closed Youngju-Na closed 7 months ago

Youngju-Na commented 7 months ago

Hello,

First off, I want to express my gratitude for the remarkable work.

I've trained PixelSplat on the DTU dataset and attempted to visualize the point cloud results (generate_point_cloud_figure.py). However, I encountered an issue where the point clouds appear to be split into two distinct sets according to each camera's coordinates (right image).

This is in contrast to the results I obtained with the re10k dataset(left image), where the point clouds did not exhibit this separation. For reference, here are the visualizations:

Image 1 Image 2

The images above illustrate the differences in point cloud visualization outcomes between the two datasets. Could you provide any insights into what might be causing this discrepancy with the visualizations?

Thank you in advance for your assistance.

dcharatan commented 7 months ago

I would first try to rule out that the dataset's camera poses have been loaded incorrectly (see #14 for an example of how to do so). Next, it's possible that the Gaussians you're seeing are a "second" layer of background Gaussians and that the "main" (3D-consistent) Gaussians are actually located much closer to the camera. You can adjust the cropping settings in the visualization script to check whether that's the case. Do the training results otherwise look reasonable?

FantasticOven2 commented 7 months ago

Hey Author, Thank you for this amazing work! I also got the same problem as I trained your model on CO3D. I output Gaussians using the first and third frames of three consecutive frames and leave the second frame to be predicted. The render for the first and third frames are clear but is very blur for the second frame; I checked the point cloud and seems like there are multiple surfaces for each input camera; is this similar to the checkerboard problem you mentioned at #15? Here's the render, epipolar lines, and point cloud visualization. Thank you in advance. media_images_valid 0th frame render_155900_97666036004d0f8dacbbmedia_images_valid 1th frame render_155900_db1a096da4736f3008e1media_images_valid 2th frame render_155900_047bc42d6ab552e8e289 batch_04 截屏2024-02-09 18 09 32

dcharatan commented 7 months ago

The first thing to quadruple-check is that the camera metadata has been loaded correctly. It can be easy to mess up CO3D data loading if you're just reading the intrinsics directly, since I'm pretty sure the code provided in this issue is actually sometimes wrong depending on the image orientation (landscape vs. portrait). CO3D also sometimes has low-quality camera poses, which can make it difficult to tell if the method is struggling or the camera poses are just bad. I think the hydrant category is generally known to have good poses, so I would recommend trying only hydrants first. There are also lists of bad sequences that you can filter out to make sure the model is getting high-quality poses.

Assuming the camera metadata is loaded correctly, one thing that might be worth trying is setting model.encoder.near_disparity to a smaller value (you can find the default value of 3.0 in config/model/encoder/epipolar.yaml). A smaller value of this parameter will set the near plane to be further away. This is because the value corresponds to the approximate distance (in terms of screen widths) that a point on the near plane will move on the image plane if the camera is moved the distance between the context views. The default value works well for the datasets we trained on, but it's possible that CO3D might work better with small values (i.e., near planes that are further away) since it's more object-centric than Real Estate 10k and ACID. In particular, a further away near plane will likely concentrate more of the probability/bucket mass on the overlapping regions between the context views.

If you want a more general approach to setting the near plane, it might be worth taking the point where the context views' frustums intersect (or some fraction of that distance, say 50%). If you don't care about having a general approach and just want CO3D to work, you can probably set the near distance to be CO3D's near plane to what the dataset provides, although you probably still want to keep a further-away far plane so the background is included.

Note that setting the far plane is easy: it's simply the depth at which the disparity becomes negligible (<0.5 pixels).

FantasticOven2 commented 7 months ago

Solved! Turns out to be a camera intrinsic problem; really amazing work, thanks!

kevinYitshak commented 6 months ago

Hi @FantasticOven2 and @dcharatan,

Could you let me know how to load the intrinsic matrix for Co3D? I also visualized the epipolar lines and they seem to match as well. example_04 example_06 example_07

I am currently loading using this function:

[# https://github.com/facebookresearch/pytorch3d/blob/main/pytorch3d/implicitron/dataset/frame_data.py#L708](# https://github.com/facebookresearch/pytorch3d/blob/main/pytorch3d/implicitron/dataset/frame_data.py#L708)

def _get_pytorch3d_camera(
    entry,
) -> PerspectiveCameras:
    entry_viewpoint = entry.viewpoint
    assert entry_viewpoint is not None
    # principal point and focal length
    principal_point = torch.tensor(entry_viewpoint.principal_point, dtype=torch.float)
    focal_length = torch.tensor(entry_viewpoint.focal_length, dtype=torch.float)

    format = entry_viewpoint.intrinsics_format
    if entry_viewpoint.intrinsics_format == "ndc_norm_image_bounds":
        # legacy PyTorch3D NDC format
        # convert to pixels unequally and convert to ndc equally
        image_size_as_list = list(reversed(entry.image.size))
        image_size_wh = torch.tensor(image_size_as_list, dtype=torch.float)
        per_axis_scale = image_size_wh / image_size_wh.min()
        focal_length = focal_length * per_axis_scale
        principal_point = principal_point * per_axis_scale
    elif entry_viewpoint.intrinsics_format != "ndc_isotropic":
        raise ValueError(f"Unknown intrinsics format: {format}")

    return PerspectiveCameras(
        focal_length=focal_length[None],
        principal_point=principal_point[None],
        R=torch.tensor(entry_viewpoint.R, dtype=torch.float)[None],
        T=torch.tensor(entry_viewpoint.T, dtype=torch.float)[None],
    )

and converting them to OpenCV format using this:

def _opencv_from_cameras_projection(
    cameras: PerspectiveCameras,
    image_size: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    R_pytorch3d = cameras.R.clone()  # pyre-ignore
    T_pytorch3d = cameras.T.clone()  # pyre-ignore
    focal_pytorch3d = cameras.focal_length
    p0_pytorch3d = cameras.principal_point
    T_pytorch3d[:, :2] *= -1
    R_pytorch3d[:, :, :2] *= -1
    tvec = T_pytorch3d
    R = R_pytorch3d.permute(0, 2, 1)

    # Retype the image_size correctly and flip to width, height.
    image_size_wh = image_size.to(R).flip(dims=(1,))

    # NDC to screen conversion.
    scale = image_size_wh.to(R).min(dim=1, keepdim=True)[0] / 2.0
    scale = scale.expand(-1, 2)
    c0 = image_size_wh / 2.0

    principal_point = -p0_pytorch3d * scale + c0
    # https://github.com/facebookresearch/co3d/issues/4#issuecomment-1952224331
    focal_length = focal_pytorch3d * scale

    camera_matrix = torch.zeros_like(R)
    camera_matrix[:, :2, 2] = principal_point
    camera_matrix[:, 2, 2] = 1.0
    camera_matrix[:, 0, 0] = focal_length[:, 0]
    camera_matrix[:, 1, 1] = focal_length[:, 1]
    return R, tvec, camera_matrix

And this is how I call these function:

  def _process_intrinsic(x):
      pycamera = _get_pytorch3d_camera(x)
      h, w = x.image.size
      _, _, K = _opencv_from_cameras_projection(pycamera, torch.tensor(((h, w),)))
      K = K.squeeze(0)
      # K in normalized based on W, h as per repo
      K[0, :] /= w
      K[1, :] /= h
      return K

Further, these are the renderings and projections for overfitted examples trained for 7k iterations. This still seems to be blurry and has some floaters. Is this expected or I am making some mistake, I am not sure.

hydrant-overfit hydrant-projections

Any help would be appreciated, Thanks!!

FantasticOven2 commented 6 months ago

Hi @kevinYitshak, Not 100% sure but I think what you got is correct; did you try to train on the entire hydrant sequences?

kevinYitshak commented 6 months ago

Hi @FantasticOven2, I did try and here is an example result!! pixelsplat

FantasticOven2 commented 6 months ago

Thanks @kevinYitshak ! Can you also show the projection figures? For me I got correct RGB and depth but checker board pattern point clouds.

kevinYitshak commented 6 months ago

Hi @FantasticOven2, These are the projections: pixelsplat_proj

Also, I am setting the near and far planes as mentioned here: https://github.com/facebookresearch/co3d/issues/18#issuecomment-954768105

Also, what was the issue in your case?

kevinYitshak commented 6 months ago

Hi @FantasticOven2, Also how did you set your near and far planes?

FantasticOven2 commented 6 months ago

Hey @kevinYitshak, Sry for the late reply, I used near=0.01 and far=100, I will try the near far planes you used