Closed Youngju-Na closed 7 months ago
I would first try to rule out that the dataset's camera poses have been loaded incorrectly (see #14 for an example of how to do so). Next, it's possible that the Gaussians you're seeing are a "second" layer of background Gaussians and that the "main" (3D-consistent) Gaussians are actually located much closer to the camera. You can adjust the cropping settings in the visualization script to check whether that's the case. Do the training results otherwise look reasonable?
Hey Author, Thank you for this amazing work! I also got the same problem as I trained your model on CO3D. I output Gaussians using the first and third frames of three consecutive frames and leave the second frame to be predicted. The render for the first and third frames are clear but is very blur for the second frame; I checked the point cloud and seems like there are multiple surfaces for each input camera; is this similar to the checkerboard problem you mentioned at #15? Here's the render, epipolar lines, and point cloud visualization. Thank you in advance.
The first thing to quadruple-check is that the camera metadata has been loaded correctly. It can be easy to mess up CO3D data loading if you're just reading the intrinsics directly, since I'm pretty sure the code provided in this issue is actually sometimes wrong depending on the image orientation (landscape vs. portrait). CO3D also sometimes has low-quality camera poses, which can make it difficult to tell if the method is struggling or the camera poses are just bad. I think the hydrant category is generally known to have good poses, so I would recommend trying only hydrants first. There are also lists of bad sequences that you can filter out to make sure the model is getting high-quality poses.
Assuming the camera metadata is loaded correctly, one thing that might be worth trying is setting model.encoder.near_disparity
to a smaller value (you can find the default value of 3.0
in config/model/encoder/epipolar.yaml
). A smaller value of this parameter will set the near plane to be further away. This is because the value corresponds to the approximate distance (in terms of screen widths) that a point on the near plane will move on the image plane if the camera is moved the distance between the context views. The default value works well for the datasets we trained on, but it's possible that CO3D might work better with small values (i.e., near planes that are further away) since it's more object-centric than Real Estate 10k and ACID. In particular, a further away near plane will likely concentrate more of the probability/bucket mass on the overlapping regions between the context views.
If you want a more general approach to setting the near plane, it might be worth taking the point where the context views' frustums intersect (or some fraction of that distance, say 50%). If you don't care about having a general approach and just want CO3D to work, you can probably set the near distance to be CO3D's near plane to what the dataset provides, although you probably still want to keep a further-away far plane so the background is included.
Note that setting the far plane is easy: it's simply the depth at which the disparity becomes negligible (<0.5 pixels).
Solved! Turns out to be a camera intrinsic problem; really amazing work, thanks!
Hi @FantasticOven2 and @dcharatan,
Could you let me know how to load the intrinsic matrix for Co3D? I also visualized the epipolar lines and they seem to match as well.
I am currently loading using this function:
[# https://github.com/facebookresearch/pytorch3d/blob/main/pytorch3d/implicitron/dataset/frame_data.py#L708](# https://github.com/facebookresearch/pytorch3d/blob/main/pytorch3d/implicitron/dataset/frame_data.py#L708)
def _get_pytorch3d_camera(
entry,
) -> PerspectiveCameras:
entry_viewpoint = entry.viewpoint
assert entry_viewpoint is not None
# principal point and focal length
principal_point = torch.tensor(entry_viewpoint.principal_point, dtype=torch.float)
focal_length = torch.tensor(entry_viewpoint.focal_length, dtype=torch.float)
format = entry_viewpoint.intrinsics_format
if entry_viewpoint.intrinsics_format == "ndc_norm_image_bounds":
# legacy PyTorch3D NDC format
# convert to pixels unequally and convert to ndc equally
image_size_as_list = list(reversed(entry.image.size))
image_size_wh = torch.tensor(image_size_as_list, dtype=torch.float)
per_axis_scale = image_size_wh / image_size_wh.min()
focal_length = focal_length * per_axis_scale
principal_point = principal_point * per_axis_scale
elif entry_viewpoint.intrinsics_format != "ndc_isotropic":
raise ValueError(f"Unknown intrinsics format: {format}")
return PerspectiveCameras(
focal_length=focal_length[None],
principal_point=principal_point[None],
R=torch.tensor(entry_viewpoint.R, dtype=torch.float)[None],
T=torch.tensor(entry_viewpoint.T, dtype=torch.float)[None],
)
and converting them to OpenCV format using this:
def _opencv_from_cameras_projection(
cameras: PerspectiveCameras,
image_size: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
R_pytorch3d = cameras.R.clone() # pyre-ignore
T_pytorch3d = cameras.T.clone() # pyre-ignore
focal_pytorch3d = cameras.focal_length
p0_pytorch3d = cameras.principal_point
T_pytorch3d[:, :2] *= -1
R_pytorch3d[:, :, :2] *= -1
tvec = T_pytorch3d
R = R_pytorch3d.permute(0, 2, 1)
# Retype the image_size correctly and flip to width, height.
image_size_wh = image_size.to(R).flip(dims=(1,))
# NDC to screen conversion.
scale = image_size_wh.to(R).min(dim=1, keepdim=True)[0] / 2.0
scale = scale.expand(-1, 2)
c0 = image_size_wh / 2.0
principal_point = -p0_pytorch3d * scale + c0
# https://github.com/facebookresearch/co3d/issues/4#issuecomment-1952224331
focal_length = focal_pytorch3d * scale
camera_matrix = torch.zeros_like(R)
camera_matrix[:, :2, 2] = principal_point
camera_matrix[:, 2, 2] = 1.0
camera_matrix[:, 0, 0] = focal_length[:, 0]
camera_matrix[:, 1, 1] = focal_length[:, 1]
return R, tvec, camera_matrix
And this is how I call these function:
def _process_intrinsic(x):
pycamera = _get_pytorch3d_camera(x)
h, w = x.image.size
_, _, K = _opencv_from_cameras_projection(pycamera, torch.tensor(((h, w),)))
K = K.squeeze(0)
# K in normalized based on W, h as per repo
K[0, :] /= w
K[1, :] /= h
return K
Further, these are the renderings and projections for overfitted examples trained for 7k iterations. This still seems to be blurry and has some floaters. Is this expected or I am making some mistake, I am not sure.
Any help would be appreciated, Thanks!!
Hi @kevinYitshak, Not 100% sure but I think what you got is correct; did you try to train on the entire hydrant sequences?
Hi @FantasticOven2, I did try and here is an example result!!
Thanks @kevinYitshak ! Can you also show the projection figures? For me I got correct RGB and depth but checker board pattern point clouds.
Hi @FantasticOven2, These are the projections:
Also, I am setting the near and far planes as mentioned here: https://github.com/facebookresearch/co3d/issues/18#issuecomment-954768105
Also, what was the issue in your case?
Hi @FantasticOven2, Also how did you set your near and far planes?
Hey @kevinYitshak, Sry for the late reply, I used near=0.01 and far=100, I will try the near far planes you used
Hello,
First off, I want to express my gratitude for the remarkable work.
I've trained PixelSplat on the DTU dataset and attempted to visualize the point cloud results (generate_point_cloud_figure.py). However, I encountered an issue where the point clouds appear to be split into two distinct sets according to each camera's coordinates (right image).
This is in contrast to the results I obtained with the re10k dataset(left image), where the point clouds did not exhibit this separation. For reference, here are the visualizations:
The images above illustrate the differences in point cloud visualization outcomes between the two datasets. Could you provide any insights into what might be causing this discrepancy with the visualizations?
Thank you in advance for your assistance.