Questions about global matcher

Parskatt / DKM

[CVPR 2023] DKM: Dense Kernelized Feature Matching for Geometry Estimation

https://parskatt.github.io/DKM/

Other

378 stars 28 forks source link

Questions about global matcher #36

Closed hanquansanren closed 1 year ago

hanquansanren commented 1 year ago

Hi Johan, Thanks for your great contribution,

I noticed that you used Gaussian processes to encode feature maps in the global matcher. We find this approach very novel and completely different from the global 4D-correlation volume used in previous methods.

We wondered what motivates you to use Gaussian processes to model this, and why is the Gaussian process suitable for solving this warp mapping prediction problem?

Best wishes, Weiguang Zhang

Parskatt commented 1 year ago

Thanks for the kind comment :)

There are actually some similarities between 4D correlation volume processing and GPs, for example the cross kernel K^{AB} is basically the 4D corr volume with exp on it. I think the main cool thing about GPs (and also why they're so expensive), is the K^{BB}^-1 term, this term ensures that if many features are very correlated they do not dominate the predictions. This is suitable for matching because we may have large regions that dominate, and we want to make sure that they do not destroy the mapping for smaller regions. In principle you could achieve something similar by having a very sharp kernel, but then you get instabilities and poor gradients.

GPs have some other downsides, in particular they assume a unimodal output space, which is why coordinates need to be embedded into a space where this makes sense.

Parskatt commented 1 year ago

I'm not a big fan of CNNs processing the correlation volume directly (for example by flattening the last two dimensions), for several reasons. One reason is that they rely a lot on locality, which we found tends to oversmooth the predictions. In DKM we still use a CNN decoder to get a warp from the embedding, which is not completely satisfactory.

In a more recent work we actually use a Transformer (no position embeddings) on the posterior of the GP to produce global/coarse matches as regression by classification (preserving the multimodality in the embeddings). If you're interested, it's here: https://github.com/Parskatt/roma

hanquansanren commented 1 year ago

thank you for your kind answers. You're a genius to discover the similarity. Actually, I'm new to research in this field, so I will try to understand this gradually.

I also notice the work of Roma yesterday, I will continue to follow it!