QVPR / Patch-NetVLAD

Code for the CVPR2021 paper "Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition"
MIT License
525 stars 74 forks source link

Why choose homography? #66

Closed setchin closed 1 year ago

setchin commented 1 year ago

Thank you very much for your excellent work! I have a question concerning the spatial scoring method. The code uses homography to estimate inlier points but homography seems to only work when all the keypoints are on the same plane, which is sometimes not the case. So I am wondering why not choosing fundamental matrix instead?

                H, mask = cv2.findHomography(index_keypoints, query_keypoints, cv2.RANSAC,
                                             ransacReprojThreshold=16*stride*1.5)
                # RANSAC reproj threshold is set to the (stride*1.5) in image space for vgg-16, given a particular patch stride
                # in this work, we ignore the H matrix output - but users of this code are welcome to utilise this for
                # pose estimation (something we may also investigate in future work)

Also, I would like to ask why the reprojection threshold is multiplied by 16 ? Thanks a lot in advance!

StephenHausler commented 1 year ago

Hi @setchin,

We decided to use homography rather than fundamental because in prototyping, we found that using homography resulting in slightly better results. In practice, it depends on the nature of the deployment environment, and planar surfaces are quite common in dense urban environments. If compute speed is not an issue, the ideal would probably be to use both homography and fundamental and then pick the one with the most inliers.

The 16* factor is accounting for the downsampling from the original image size to the feature map size of patch-netvlad (due to vgg-16 pooling). Because of the downsampling, the keypoints (in pixel coordinates) will always be at least 16 pixels apart from each other.

Hope this helps! Closing this issue now, but feel free to reopen if you have any further questions.

setchin commented 1 year ago

Thanks for the detailed explanation, it helps a lot!

divyagupta25 commented 1 year ago

Hey @StephenHausler , one question about the reprojection threshold, why is it stride*1.5? @setchin in case you understood, could you please explain?

setchin commented 1 year ago

Hey @StephenHausler , one question about the reprojection threshold, why is it stride*1.5? @setchin in case you understood, could you please explain?

I assume it leaves some room for RANSAC error?

StephenHausler commented 1 year ago

Yes, pretty much. The challenge with Patch-netvlad is that the local features are limited in resolution based on the resolution of the feature maps, which means that a large threshold is needed. The minimum error can't really ever be less than 16 pixels. The 1.5 times factor is a buffer giving RANSAC some more 'room' to optimize.