facebookresearch / PoseDiffusion

[ICCV 2023] PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment
Other
718 stars 42 forks source link

conversion to colmap (nerf dataset) #27

Closed wooni-github closed 6 months ago

wooni-github commented 11 months ago

Hi everyone,

I attempted to convert estimated poses to the Colmap format for the NeRF dataset, following these. Here is my code.

from scipy.spatial.transform import Rotation

R_pytorch3d = pred_cameras.R
R_pytorch3d[:, :, :2] *= -1

T_pytorch3d = pred_cameras.T
T_pytorch3d[:, :2] *= -1

R_colmap = R_pytorch3d.transpose(-2, -1)  # swap the last two dimensions, i.e., 3x3
T_colmap = T_pytorch3d

for i in range(pred_cameras.R.shape[0]):
    # Rots = pred_cameras.R[i].cpu().tolist()
    Rots = R_colmap[i].cpu().tolist()

    quaternion = Rotation.from_matrix(Rots).as_quat()
    quaternion_new = [quaternion[3], quaternion[0], quaternion[1], quaternion[2]]
    # as_quat(): [qx, qy, qz, qw] -> [qw, qx, qy, qz]
    qw, qx, qy, qz = quaternion_new

    # Ts = pred_cameras.T[i].cpu().tolist()
    Ts = T_colmap[i].cpu().tolist()
    tx, ty, tz = Ts
    print(f'{qw} {qx} {qy} {qz} {tx} {ty} {tz}')

avg_x = torch.mean(pred_cameras.focal_length[:, 0])
avg_y = torch.mean(pred_cameras.focal_length[:, 1])
fx = avg_x.item()
fy = avg_y.item()
ff = (fx + fy) / 2.0
print("Average x:", fx)
print("Average y:", fy)

answer = ' '.join(['SIMPLE_PINHOLE', str(1920), str(1080), str(ff), str(1920 // 2), str(1080 // 2)])
answer = ' '.join(['PINHOLE', str(1920), str(1080), str(fx), str(fy), str(1920 // 2), str(1080 // 2)])

I also used the simple_pinhole model, after convert results to colmap format (with cameras.bin, images.bin, points3D.bin) "colmap point_triangulator" fails : triangulation always fails and there are zero matches. What could be the issue?

I referred to this and this to run Colmap commands. It worked when I used other methods, such as Charuco camera intrinsic and extrinsic calibration.

jytime commented 11 months ago

Hi,

Could you check the value of the focal length? Typically, colmap operates with focal lengths expressed in pixel units. For instance, for an image size of (1024, 1024), the expected focal length by colmap is usually around 1000 pixels. In contrast, our focal lengths are specified in pytorch3d's NDC coordiante, which would be quite smaller, e.g., 2. This discrepancy might be causing the triangulation issues in colmap. Please check the pytorch3d documentation on camera handling at https://pytorch3d.org/docs/cameras to understand how to convert NDC focal lengths to pixel values.

Best, Jianyuan

wooni-github commented 11 months ago

Thanks for your reply.

I converted the focal length to

ff = (1920**2 +1080**2)**0.5
# which is same to (w*w + h*h)**0.5

However, still it has problem I guess.

I am attaching an example image and COLMAP results.

This is an example image: 0030

These are the Diffusion pose results (extracted poses and focal length, principal points -> COLMAP with pre-computed poses):

2023-12-20 09 36 15

And these are the COLMAP results (common usage, without any prior information such as intrinsic, extrinsic):

2023-12-20 09 39 20 While the Diffusion pose gives a slightly similar poses trajectory, it returns a small number of triangulated points, and there also seems to be something wrong (like scale, etc.).

Charuco and chessboard are not related to this question. Actually, I tried to convert intrinsic and extrinsic from them and applied them to COLMAP (it succeeded).

jytime commented 11 months ago

Hi,

Glad to see it that shows something now. Would you mind elaborate more about:

ff = (1920**2 +1080**2)**0.5
# which is same to (w*w + h*h)**0.5

It looks like you are using a default focal length for all the images, which is the common practice for initialization. However, focal length should be consistent with R and t. For example, in colmap, focal length would be first initialized in a similar way to the method you used (they assume the default focal length is 1.2 x image_length), and then R, t, and FL are optimized together during bundle adjustment (BA). In our case, focal length is also optimized together with R, t during GGS. If we replace the optimized focal length with a default one, it would destroy the 3D structure.

Here I would suggest to try: a) converting the FL predicted by PoseDiffusion to the value of pixel instead of using a default/assumed value, or b) using the BA code of colmap to further optimize the R, t, and FL you have at this stage. You may refer to pycolmap for convenience.

Besides, it is worth mentioning that, for the provided sample, we expect that our method cannot reach a result as accurate as colmap, because this case can provide super accurate and reliable matches (by chessboard) for the BA. However, based on my previous observations, PoseDiffusion should provide a reasonable result, i.e., we should be able to see the overall structure of the chessboard there, instead of the messy points shown now.