facebookresearch / silk

SiLK (Simple Learned Keypoint) is a self-supervised deep learning keypoint model.
GNU General Public License v3.0
645 stars 58 forks source link

Question about supervision signal for Keypoint Head #21

Closed TuanTNG closed 1 year ago

TuanTNG commented 1 year ago

Hi,

Thank you for your excellent work.

I have some questions related to your work.

First, in your paper, you wrote about the Keypoint Head that "it is trained to identify keypoints with successful round-trip matches (defined by mutual-nearest-neighbor) among all others (unsuccessful)". I have some questions as follows:

  1. How can you ensure the "successful round-trip matches" are keypoints? Can you define keypoint in this scenario? and why are "successful round-trip matches" keypoints?
  2. If all the pixels in 2 images have successful round-trip matches, are they all keypoints?
  3. At early training, if there is no "successful round-trip match", will there be no positive samples for Keypoint Head?
  4. Can you help explain the inference stage (e.g., the role of keypoint detection head, etc.). I cannot find it in your paper.
  5. By the way, besides accuracy and ease to train, can you point out some advantages between your work and SuperPoint because I see that VGGnp-4 runs at 12fps which is much slowlier than SuperPoint, 70fps? I am looking forward to hearing from you soon.

Best regards, Tuan

gleize commented 1 year ago

Hi @TuanTNG,

  1. How can you ensure the "successful round-trip matches" are keypoints? Can you define keypoint in this scenario? and why are "successful round-trip matches" keypoints?

In short, we kind of redefine keypoints from first principle (similar to DISK / GLAMpoints).

The initial goal of keypoints is to be distinct and robust to reasonable viewpoint / photometric changes, so that they can be tracked across multiple frames. The research has focused on corners (e.g. Harris, SuperPoint) for a long time since corners are known to have those properties.

In our work however, we focus on learning keypoints to have those properties directly instead of relying on a proxy objective (i.e. learning "cornerness"). By measuring the round-trip success, we are essentially measuring the ability of a position to become a good keypoint (i.e. having the two properties mentioned above). Descriptors that are neither distinct nor robust are unlikely to match correctly, and therefore, this is a good signal to regress the keypoint score on. By extending the definition of keypoints, we observe that our model can not only capture corners, but also more complex patterns (e.g. curves, complex textures, ...).

  1. If all the pixels in 2 images have successful round-trip matches, are they all keypoints?

In theory yes, but that doesn't happen in practice. For example, images often contain large areas of uniform colors, or repetitive patterns. Given the local nature of keypoint descriptors, they do not contain enough information to obtain perfect matching.

  1. At early training, if there is no "successful round-trip match", will there be no positive samples for Keypoint Head?

Yes, as you said, there are no successful matches initially. So the keypoint head essentially converges towards outputting 0 everywhere, and the keypoint loss descreases accordingly (i.e. it's doing a good job at predicting every keypoints will fail matching). However, after a short while, the descriptors start to become more discriminative, successful matches become more frequent, and the keypoint loss start to increase until it stabilizes (i.e. learning what keypoints are likely to match becomes a harder problem to solve).

  1. Can you help explain the inference stage (e.g., the role of keypoint detection head, etc.). I cannot find it in your paper.

The inference stage is fairly straightforward, and follow the standard "detect-and-describe" pattern. We first get the dense keypoint score output, then we select the top-k positions (i.e. the best k keypoints) from that dense map. Once we know the keypoint positions, we can extract the associated descriptors at those positions (from the dense descriptor map). So the output of the model is a set of positions, with associated descriptors that can be used in the subsequent matching step.

  1. By the way, besides accuracy and ease to train, can you point out some advantages between your work and SuperPoint because I see that VGGnp-4 runs at 12fps which is much slowlier than SuperPoint, 70fps?

It's difficult to get a fair comparison of FPS across papers since there are multiple factors that could affect speed (implementation quality, hardware, ...). Our released FPS numbers are given to be compared relatively to each other to get a sense of relative speed between backbones, but should not be taken as absolute since speed is often a function of engineering effort (e.g. SiLK could be put on a chip and become orders of magnitude faster) and hardware.

That being said, I've just ran a quick SiLK VGG-4 vs SuperPoint comparison to get some specific numbers. On 480x269 images (and a different machine than the one used in the paper), SuperPoint gets 83 FPS while SiLK gets 30 FPS. The large gap is explained by the lack of downsampling layers in SiLK. We don't consider that to be too bad, and would likely benefit from further architectural investigations, as future work.

Additionally, an interesting consequence of our architecture is the ability to become more accurate when using a smaller resolution (c.f. supplementary table 10), given an error threshold of 3. This shows that SiLK can still beat SuperPoint on the @3 metrics even when reducing the resolution by a factor of three. When doing so, SiLK gets a FPS of 68, which becomes a lot closer to SuperPoint's numbers.

I hope those answers help.

TuanTNG commented 1 year ago

Hi @gleize,

Thank you for your information. I will close the issue.