Closed qsisi closed 5 months ago
Hi, thanks for the kind words!
Hope this helps! Chris
Thanks for your clarification!
Also, I have a question on the positional encoding implemented here:
for j in range(h):
for k in range(w):
w1, w2, w3 = torch.split(Kinv @ torch.tensor([xs[k], ys[j], 1]), 1, dim=1)
p3[:, int(k * w + j)] = w2.squeeze() / w3.squeeze()
p4[:, int(k * w + j)] = w1.squeeze() / w3.squeeze()
x = self.extractor_final_conv(x) # 192, 24, 24
x = x.reshape([input_images.shape[0], -1, self.num_patches])
so the position encoding should be computed at a corresponding order?
Looking forward to your reply!
Also, the position encoding normalizes the images to [-1, -1] by dividing the intrinsic with hpix&wpix, which is cy 2 and cx 2 in the code, why not just divide intrinsic with the image height and width? given that cx&cy does not always equal to width/2&height/2 in most real-world cameras :)
Sorry to bother you again, but did someone find that the gt extrinsic of Matterport is not accurate?
Hi,
Not sure what you're referring to regarding the Matterport extrinsics not being accurate. I think there was a paper showing the dataset did not have roll rotations, rather only pitch and yaw. I think this is a good motivation for us to evaluate on other datasets, to validate the approach still works if faced with more generic rotations.
The positional encoding question I haven't dug into, but I think dividing by image height and width instead is a reasonable approach; the ordering you share is interesting -- are you sure that's not equivalent? I visualized the positional encodings before and they made sense to me. Maybe a visual of what is going on can clairfy this issue?
Thanks, Chris
The extrinsics is accurate, looks like I missed the flip_axis: https://github.com/crockwell/far/blob/main/mp3d_loftr/src/utils/dataset.py#L222 which was found in your FAR implementation. However, it seems like you didn't add the flip_axis in this 8-ViT implementation? So is this a mistake?
Ah gotcha -- this flip axis is to make the coordinate convention consistent with the Hartley and Zisserman coordinate system (can be helpful if e.g. beginning from a pretrained network or more easily visualizing / extending to other datasets). While 8-Point ViT does not perform this and FAR does, the extrinsics are not incorrect on 8-Point ViT -- rather it uses a different coordinate system.
Hello! Thanks for open-sourcing this amazing work! Here I have some questions about the paper.
Looking forward to your reply!