facebookresearch / PoseDiffusion

[ICCV 2023] PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment
Other
718 stars 42 forks source link

about the coordinate system #9

Closed hdzmtsssw closed 1 year ago

hdzmtsssw commented 1 year ago

Hi there, thanks for your great work! I'm a bit confused about the coordinate system. Could you please explain how to transform the coordinate system to Colmap style? It seems that the model coordinates in NDC are compressed. Also, I was wondering if the ground truth pose is necessary and, If so, which coordinate system it should follow. Additionally, what's the purpose of corresponding_cameras_alignment?

0-29: [pred_cameras.R, pred_cameras.T] 30-59: colmap image

jytime commented 1 year ago

Hi @hdzmtsssw ,

In the PyTorch3D NDC coordinate, "+X points left, and +Y points up and +Z points out from the image plane". In COLMAP, "X axis points to the right, the Y axis to the bottom, and the Z axis to the front as seen from the image." Because PyTorch3D conducts multiplication by left, we need to transpose the rotation matrix.

I will provide a transform function when we release the evaluation code. If you want to do it early, you can try something like below (I do not have time to carefully check it now so please be careful):

R_pytorch3d[:, :, :2] *= -1 
T_pytorch3d[:, :2] *= -1
R_colmap = R_pytorch3d.transpose(-2, -1)      # swap the last two dimensions, i.e., 3x3
T_colmap = T_pytorch3d

Ground truth pose is not necessary. We just use it to show how to compute the error metric.

Actually the model coordinate in NDC was not "compressed". For a given set of images, people cannot get its real "scale". So different methods just use some normalised units.

corresponding_cameras_alignment is necessary if you want to compare two sets of cameras but don't know the scale of the scene. Generally, it estimates a single similarity transformation between two sets of cameras cameras_src and cameras_tgt and returns an aligned version of cameras_src. In other words, this function will try to align a set of cameras to the target one by predicting a global rotation, a global translation, and a scale.

hdzmtsssw commented 1 year ago

@jytime Thanks for your fast reply! I applied the transformation you mentioned to the pred_cameras, and it looks good on the sample scene(apple). However, it didn't produce good results on my own dataset, which consists of 30 forward-facing images(3840 × 2160). Does your method not support forward-facing scenes?

I plan to use your method to replace Colmap in obtaining poses for NeRF training. Do I need to provide the ground truth pose from Colmap to obtain the correct scale and aligned version? If so, which coordinate system should the ground truth pose follow(PyTorch3D NDC coordinate?) and which transform function should it use(the same code you provided?)?

Lastly, could you please let me know when the script for NeRF training will be released?

pred_cam_transpose Figure 1. apple - transformed

pred_cam_transpose Figure2. custom data - transformed

a6f7a63d-32ea-4719-afcd-c4527e4549c4 Figure 3. custom data - predict_cameras output(0-29) and colmap(20-59)

jytime commented 1 year ago

Hi @hdzmtsssw ,

The result looks a bit weird. Forward-facing images should be okay. Did you use GGS for this result? Or can you share some of the images?

hdzmtsssw commented 1 year ago

Hi, @jytime I used the default configuration, so I use GGS for this result.

Unfortunately, the dataset is confidential, so I cannot share the images. The camera array was fan-shaped and captured four people standing in front of a green screen, with varying intrinsics between the cameras.

In another scene captured using the same inward-facing camera array as before, the transformed predict_cameras appear to be outward-facing.

I also tried the LLFF forward-facing dataset, such as the fern scene, which appears to be correct(?), but it seems that it is not aligned. I attempted to align it, but the result does not match with Colmap.

pred_cam_transpose Fig 1. custom data2 - transformed

pred_cam_transpose Fig 2. fern - transformed

colmap Fig 3. fern - colmap

jytime commented 1 year ago

Hi @hdzmtsssw ,

I guess the inward-facing and outward-facing problem comes from the coordinate transform, e.g., "transpose the rotation matrix or not" or "coordinate xyz direction ". I would suggest to check if the coordinate transform code works well.

Regarding "not aligned", how did you conduct camera alignment? I am a bit confused. For example, we can see the scales of Fig 2 and Fig3 are quite different.

There may be multiple solutions for alignment, e.g., (1) use "corresponding_cameras_alignment" (2) a simple and fast way force the first camera in each camera set to be the origin, and use the second camera to compute the alignment matrix. The second solution may be quite inaccurate but can give you a quick insight on how they look like. But please be aware that the input and target cameras must stay in the same coordinate system.

By the way, if you use the "ground truth" camera poses from LLFF dataset, please note that LLFF has its own coordinate system.

If you could provide a minimal, reproducible example of the code on LLFF, I'd be happy to take a look.

hdzmtsssw commented 1 year ago

Hi @jytime , thanks for your reply.

Here is the example code. ref:https://github.com/Fyusion/LLFF/blob/c6e27b1ee59cb18f054ccb0f87a90214dbe70482/llff/poses/pose_utils.py#L51C29-L51C34

# transform.py  :get gt_cameras.npy(left, up, forward) from NeRF poses(right, up, backward)
pose = np.load('cams_meta.npy')
bottom = np.tile(np.array([0., 0., 0., 1.]).reshape(1, 1, -1), (pose.shape[0], 1, 1))
K = pose[:, 12:21].reshape(-1, 3, 3)
focal = np.concatenate([K[:, 0:1, 0], K[:, 1:2, 1]], -1)
c2w = pose[:, :12].reshape(-1, 3, 4) # (x, y, z): (right, up, backward)
c2w_new = np.concatenate([-c2w[:, :, 0:1], c2w[:, :, 1:2], -c2w[:, :, 2:3], c2w[:, :, 3:]], 2) # transform to (-x, y, -z): (left, up, forward), which LLFF used
c2w_new = np.concatenate([c2w_new, bottom], -2)
w2c = np.linalg.inv(c2w_new)
T = w2c[:, :3, 3]
R = np.transpose(w2c[:, :3, :3], (0, 2, 1)) # right mul to left mul?
np.savez('gt_cameras.npz', gtR=R, gtT=T, gtFL=focal)

# demo.py    :get pred_cameras w2c(I think so)
R = pred_cameras.R.cpu().numpy()
T = pred_cameras.T.unsqueeze(-1).cpu().numpy()
poses = np.concatenate([R, T], axis=-1)
np.save(os.path.join(folder_path, "pred_cameras.npy"), poses)

# then use corresponding_cameras_alignment to align

Fern: For samples/apple: the absolute rotation error is 13.783582 degrees.

For visualization, transform to c2w:

pose = np.load('pred_cameras.npy"')
R = np.transpose(pose[:, :3, :3], (0, 2, 1)) # left mul to right mul?
pose[:, :3, :3] = R
bottom = np.tile(np.array([0., 0., 0., 1.]).reshape(1, 1, -1), (pose.shape[0], 1, 1))
w2c = np.concatenate([pose, bottom], -2)
c2w = np.linalg.inv(c2w_new)
c2w_new = np.concatenate([-c2w[:, :, 0:1], c2w[:, :, 1:2], -c2w[:, :, 2:3], c2w[:, :, 3:]], 2) # transform to (x, y, z)
...visualization...
jytime commented 1 year ago

Hi @hdzmtsssw ,

After a quick glance, I am a bit confused about the source of 'cams_meta.npy'. Does this contain the cameras from COLMAP, LLFF, or any other source? Based on the comment, c2w = pose[:, :12].reshape(-1, 3, 4) # (x, y, z): (right, up, backward) it looks you assume it is "right, up, backward".

However, for COLMAP, it uses "right, down, forwards". For LLFF, it uses "down, right, backwards". You can find the corresponding information as discussed in LLFF.

Would this be a source of the problem?

hdzmtsssw commented 1 year ago

Hi, @jytime , The coordinate system in 'cams_meta.npy' is the Vanilla NeRF coordinate system ("OpenGL coordinate system"). It is ok and can be well transformed from COLMAP or LLFF. ref: https://github.com/bmild/nerf#already-have-poses

jytime commented 1 year ago
fern

Hi @hdzmtsssw ,

It looks the main problem is that the motion for fern is very small, so we need alignment with higher accuracy. You can try on your own code like:

    pred_cameras_aligned = corresponding_cameras_alignment(
        cameras_src=pred_cameras,
        cameras_tgt=colmap_cameras,
        estimate_scale=True,
        mode= "extrinsics",
        eps= 1e-9,
    )

Or you can use my code to reproduce the visualisation above, which uses visdom.

Please note that in any case, the alignment cannot be perfect. So the visualisation can only give you a sense of how the structure is.