Localization of real images with respect to a textured scan using QueryLocalizer

First of all, thanks for the great work.

I am trying to localize several real photos of small object to a textured scan and I'm having issues making it work.

For my dataset I have:

a 3D mesh with texture obtained form a scanner
rendered images of a scan
real images of small object for which I want to find a pose.

My pipeline looks like this: 1) Get intrinsics for real photos from COLMAP. 2) Set virtual camera trajectory. 40 cameras are located on the sphere, looking at the object. 3) Render images of a 3D mesh with texture from intrinsics (step 1) and extrinsics (step 2). 4) Build PixSfM reconstruction from rendered images (step 3) using SuperGlue+SuperPoint. 5) Extract features for queries (real images of small object) and match them with references (step 4) using SuperGlue+SuperPoint. 6) Find poses for 70-100 real images (queries) using QueryLocalizer with intrinsics from step 1. 7) Render query images with found poses from QueryLocalizer.

The example of renders (step 7) look like this:

good localization:
bad localization:

I found out that QueryLocalizer is working if I use rendered reference and rendered query images. It was able to find good poses for all query images, however, when I try to localize real query images to the model from rendered images, it works poorly. Below, you can see the result for experiments with rendered reference and rendered query images (SuperGlue+SuperPoint). These images were obtained via visualization.visualize_loc_from_log from hloc toolbox.

Screenshot from 2023-03-08 12-19-24

And here I am trying to localize several real photos of small objects to a textured scan.

Screenshot from 2023-03-08 12-19-33

Also I tried various configurations for feature extractors and matchers, specifically SuperGlue+SuperPoint, D2-Net+NN-Superpoint, R2D2+NN-Superpoint, SIFT+NN-Superpoint, SOSNet+NN-Superpoint, DISK+NN-Superpoint, etc. I hoped that Superglue+Superpoint will let me to get the best result, but somehow SIFT+NN-Superpoint worked better for my indoor dataset. Some experiments stopped on the stage of building PixSfM reconstruction, it cannot successfully perform triangulation and create reconstruction.

I tried to follow your suggestions here by checking some image pairs and their matches and validating that both references and queries have same scale. Unfortunately, it did not work for me.

I tried to change parameter "resize_max" for different feature extractors, and after doubling this parameter up I obtained more successfully located query images.

Also, I tried same experiments for an indoor scene, but it did not work as well.

I hope you will be able to suggest anything to make it work, I am looking forward to your reply!

Hi @makmary, it is great to see hloc+pixsfm being used in practice! And sorry for the late reply.

The objects in your scene have very little texture, so it is quite hard to detect repeatable keypoints across images. You could try semi-dense matching via hloc+loftr (see match_dense.py), which does not rely on detections but rather uses a grid. However, semi-dense matching requires quantization of grid points after matching which introduces a localization error in the detections, which results in noisy pose estimates. To fix this, you can run pixsfm-KA as a denoiser.

Another note on deep image matching: Most of the current methods can only detect reliable matches up to 30-45° relative rotation between images. Therefore, if you have some prior about the orientation of the image (like the direction of gravity), aligning/rotating the images should greatly increase the number of correct matches.

Let me know if this solves your issue or if you have some more questions.

cvg / pixel-perfect-sfm

Localization of real images with respect to a textured scan using QueryLocalizer #95