crockwell / rel_pose

[3DV 2022] The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs
BSD 3-Clause "New" or "Revised" License
91 stars 5 forks source link

Several Questions about the paper #9

Closed qsisi closed 5 months ago

qsisi commented 6 months ago

Hello! Thanks for open-sourcing this amazing work! Here I have some questions about the paper.

  1. I wonder how to get the actual U^T@U given the ground-truth rotation and translation as shown in Figure 9 in the paper? In my understanding, constructing actual U^T@U requires ground truth pixel correspondences, did you use some sort of descriptor->matching->critical-correspondences-filtering pipeline to get the actual pixel matches?
  2. I still couldn't understand why the network could learn to predict translation in actual scale, in my understanding it is unlikely to predict relative translation in actual scale by just giving 2D-pixel information as the network' input, could you kindly elaborate more on that? thanks!

Looking forward to your reply!

crockwell commented 6 months ago

Hi, thanks for the kind words!

  1. Agreed, this is using ground truth. Yeah an easy method for finding true correspondences can be found in LoFTR's codebase, which given ground truth pose and depth computes mutual nearest neighbor check to obtain true correspondences. From this we can construct U and then U^T@U.
  2. Good point: translation scale is ambiguous from correspondence information alone. By training an end-to-end model though, we allow the network to learn to estimate scale from input features. Like predicting metric depth or reconstructing 3D from 2D pixels, the learned network can do a decent job even though it does not actually have 3D information as input. An analogy would be as a human we can do a decent job of estimating scale from a single image, even without actually having any clear 3D information.

Hope this helps! Chris

qsisi commented 6 months ago

Thanks for your clarification!

Also, I have a question on the positional encoding implemented here:

for j in range(h):
  for k in range(w):
    w1, w2, w3 = torch.split(Kinv @ torch.tensor([xs[k], ys[j], 1]), 1, dim=1)
    p3[:, int(k * w + j)] = w2.squeeze() / w3.squeeze() 
    p4[:, int(k * w + j)] = w1.squeeze() / w3.squeeze() 
  1. p3 and p4 corresponds to the x and y right? So the p3 = w1 / w3, p4 = w2 / w3?
  2. The 1d-index should be row_index width + col_index? So it should be int(j w + k) instead of int(k * w + j) ? Since the resnet features are flattened row-by-row as:
    x = self.extractor_final_conv(x) # 192, 24, 24 
    x = x.reshape([input_images.shape[0], -1, self.num_patches])

    so the position encoding should be computed at a corresponding order?

Looking forward to your reply!

qsisi commented 6 months ago

Also, the position encoding normalizes the images to [-1, -1] by dividing the intrinsic with hpix&wpix, which is cy 2 and cx 2 in the code, why not just divide intrinsic with the image height and width? given that cx&cy does not always equal to width/2&height/2 in most real-world cameras :)

qsisi commented 6 months ago

Sorry to bother you again, but did someone find that the gt extrinsic of Matterport is not accurate?

crockwell commented 6 months ago

Hi,

Not sure what you're referring to regarding the Matterport extrinsics not being accurate. I think there was a paper showing the dataset did not have roll rotations, rather only pitch and yaw. I think this is a good motivation for us to evaluate on other datasets, to validate the approach still works if faced with more generic rotations.

The positional encoding question I haven't dug into, but I think dividing by image height and width instead is a reasonable approach; the ordering you share is interesting -- are you sure that's not equivalent? I visualized the positional encodings before and they made sense to me. Maybe a visual of what is going on can clairfy this issue?

Thanks, Chris

qsisi commented 6 months ago

The extrinsics is accurate, looks like I missed the flip_axis: https://github.com/crockwell/far/blob/main/mp3d_loftr/src/utils/dataset.py#L222 which was found in your FAR implementation. However, it seems like you didn't add the flip_axis in this 8-ViT implementation? So is this a mistake?

crockwell commented 6 months ago

Ah gotcha -- this flip axis is to make the coordinate convention consistent with the Hartley and Zisserman coordinate system (can be helpful if e.g. beginning from a pretrained network or more easily visualizing / extending to other datasets). While 8-Point ViT does not perform this and FAR does, the extrinsics are not incorrect on 8-Point ViT -- rather it uses a different coordinate system.