TODO: Here I will expand on the various methods trying to provide a brief description, highilight the main differences and pros and cons from everyone of them
Learning Feature Matching with Graph Neural Networks
NN that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph
NN.
SuperGlue is trained end-to-end on image pairs, allowing it to learn priors over geometric transformations and regularities of the 3D world directly from data.
We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems.
This formulation enforces the assignment structure of the predictions while enabling the cost to learn complex priors, elegantly handling occlusion and non-repeatable keypoints. Our method is trained end-to-end from image pairs – we learn priors for pose estimation from a large annotated dataset, enabling SuperGlue to reason about the 3D scene and the assignment.
Use of self- and cross-attention to leverage both spatial relationship of the keypoints and their visual appearance
Trained end-to-end from image pairs, such as to learn priors for pose estimation from a large annotated dataset, enabling to reason about the 3D scenes and when combined with SuperPoint, enables to do pose estimation.
Flexible cost using a Deep NN
2D keypoints are usually projections of salient 3D points, like corners or blobs, thus correspondences across images must adhere to certain physical constraints:
1) A keypoint can have at most a single correspondence in the other image
2) Some keypoints will be unmatched due to occlusion and failure of the detector.
An effective model for feature matching should aim at finding all correspondences between reprojections of the same 3D points and identifying keypoints that have no matches.
SuperGlue is formulated to solve an optimization problem, whose cost is predicted by a deep NN. This alleviates the need for domain expertise and heuristics, as it learns relevant priors directly from the data.
Correspondances derive from a partial assignment between the two sets of keypoint between the two sets of keypoints. For the integration into downstream tasks and better interpretability, each possible correspondence should have a confidence value.
Attentional Graph Neural Network
Consider two images A and B, each with local features which are a set of keypoint positions p (x, y image coordinates and detection confidence c) and associated visual descriptors d (extracted by SuperPoint).
The first component uses a keypoint encoder to map keypoint positions p and their visual descriptors d into a single vector, and then uses alternating self- and cross-attention layers (repeated L times) to create more powerful representations f
The Attentional Graph Neural Network block is responsible of computing matching descriptors by letting the feature communicate with each other , alternating self- and cross- attention layers
We embed the keypoint position into a high-dimensional vector with a Multilayer Perceptron (MLP), This encoder enables the graph network to later reason about both appearance and position jointly, especially when combined with attention.
MLP: feedforward NN, fully connected neurons with nonlinear activation function, at least 3 layers, data not linearly separable
We consider a single complete graph whose nodes are the keypoints of both images. The graph has two types of undirected edges. Intra-image edges, or self edges, connect keypoints i to all other keypoints within the same image. Inter-image edges, or cross edges, connect keypoints i to all keypoints in the other image. We use the message passing formulation to propagate information along both types of edges. The resulting multiplex GNN starts with a high-dimensional state for each node and computes at each layer an updated representation by simultaneously aggregating messages across all given edges for all nodes.
The message is computed as a weighted average of the values and it's computed by an attention mechanism that performs the aggregation.
Each layer has its own projection parameters, learned and shared for all keypoints of both images. In practice, we improve the expressivity with multi-head attention, providing maximum flexibility and to increase the model expressivity as the network can learn to focus on a subset of keypoints based on specific attributes. SuperGlue can retrieve or attend based on both appearance and keypoint location. This includes attending to a nearby keypoint and retrieving the relative positions of similar or salient keypoints. This enables representations of the geometric transformation and the assignment. The final matching descriptors f are linear projections
Optimal Matching Layer
produces a partial assignment matrix.
the assignment P can be obtained by computing a score matrix S for all possible matches and maximizing the total score.
express the pairwise score as the similarity of matching descriptors
the matching descriptors are not normalized, and their magnitude can change per feature and during training to reflect the prediction confidence.
To let the network suppress some keypoints, we augment each set with a dustbin so that unmatched keypoints are explicitly assigned to it.
dustbins have also been used by SuperPoint to account for image cells that might not have a detection. We augment the scores S to S' by appending a new row and column, the point-to-bin and bin-to-bin scores, filled with a single learnable parameter
each dustbin has as many matches as there are keypoints in the other set
Sinkhorn Algorithm: The solution of the above optimization problem corresponds to the optimal transport between discrete distributions a and b with scores S', this is a differentiable version of the Sinkhorn algorithm, classically used for bipartite matching, that consists in iteratively normalizing exp(S') along rows and columns, similar to row and column Softmax.
Loss function
both the graph neural network and the optimal matching layer are differentiable – this enables back-propagation from matches to visual descriptors
SuperGlue is trained in a supervised manner from ground truth matches, estimated from ground truth relative transformations – using poses and depth maps or homographies.
This also lets us label some keypoints as unmatched if they do not have any reprojection in their vicinity
This supervision aims at simultaneously maximizing the precision and the recall of the matching.
Comparisons to related work
The SuperGlue architecture is equivariant to permutation of the keypoints within an image. Unlike other handcrafted or learned approaches, it is also equivariant to permutation of the images, which better reflects the symmetry of the problem and provides a beneficial inductive bias. Also, the optimal transport formulation enforces reciprocity of the matches.
Attention, as used by SuperGlue, is a flexible and powerful context aggregation mechanism, that allows to treat keypoints accordingly to their weight. SuperGlue can jointly reason about appearance and position and it only needs local features, learned or handcrafted, and can thus be a simple drop-in replacement for existing matchers. SuperGlue borrows the self-attention from the Transformer, but embeds it into a graph neural network, and additionally introduces the cross-attention, which is symmetric. This simplifies the architecture and results in better feature reuse across layers.
Implementation Details:
SuperGlue can be combined with any local feature detector and descriptor but works particularly well with SuperPoint, which produces repeatable and sparse keypoints. Visual descriptors are bilinearly sampled from the semi-dense feature map.
All intermediate representations (key, query value, descriptors) have the same dimension D = 256 as the SuperPoint descriptors. L=9 layers of alternating multi-head self- and cross-attention with 4 heads each, T=100 Sinkhorn iterations.
A forward pass takes on average 69 ms (15 FPS) for an indoor image pair.
To allow for data augmentation, SuperPoint detect and describe steps are performed on-the-fly as batches during training. A number of random keypoints are further added for efficient batching and increased robustness.
Experimental results:
All learned methods, including SuperGlue, are trained on ground truth correspondences, found by projecting keypoints from one image to the other. We generate homographies and photometric distortions on-the-fly.
Match precision (P) and recall (R) are computed from the ground truth correspondences, performed both with RANSAC and DLT (Direct Linear Transformation), SuperGlue is sufficiently expressive to master homographies, achieving 98% recall and high precision.
SuperPoint and SuperGlue complement each other well since repeatable keypoints make it possible to estimate a larger number of correct matches even in very challenging situations.
Visualization of how the attention layers work
Strenghts
End-to-end learning
Unified architecture that performs simultaneously context aggregation, matching and filtering
Self-attention boosts the receptive field of local descriptors and cross-attention enables cross-image communication and is inspired by the way humans look back-and-forth when matching images. (does this count as bioinspired??)
Utilizes a GNN to aggregate context both within and across images, which enhances the robustness and accuracy of feature matching
Global optmization ensured by Sinkhorn's optimization study and handles partial assignments and occluded points
Partial assignment matrix with dustbin ensure a way to filter uncertaint keypoints
Can be integrated into SfM and SLAM systems
Limitations
GNNs and attention mechanisms can be computationally complex
Check out LightGlue to see the improvements from this method
TODO: Here I will expand on the various methods trying to provide a brief description, highilight the main differences and pros and cons from everyone of them
I will start by analyzing (SuperGlue, LoFTR, LightGlue and PATS)
SuperGlue
Learning Feature Matching with Graph Neural Networks
NN that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph NN.
SuperGlue is trained end-to-end on image pairs, allowing it to learn priors over geometric transformations and regularities of the 3D world directly from data.
We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems.
This formulation enforces the assignment structure of the predictions while enabling the cost to learn complex priors, elegantly handling occlusion and non-repeatable keypoints. Our method is trained end-to-end from image pairs – we learn priors for pose estimation from a large annotated dataset, enabling SuperGlue to reason about the 3D scene and the assignment.
Use of self- and cross-attention to leverage both spatial relationship of the keypoints and their visual appearance Trained end-to-end from image pairs, such as to learn priors for pose estimation from a large annotated dataset, enabling to reason about the 3D scenes and when combined with SuperPoint, enables to do pose estimation. Flexible cost using a Deep NN
2D keypoints are usually projections of salient 3D points, like corners or blobs, thus correspondences across images must adhere to certain physical constraints: 1) A keypoint can have at most a single correspondence in the other image 2) Some keypoints will be unmatched due to occlusion and failure of the detector. An effective model for feature matching should aim at finding all correspondences between reprojections of the same 3D points and identifying keypoints that have no matches. SuperGlue is formulated to solve an optimization problem, whose cost is predicted by a deep NN. This alleviates the need for domain expertise and heuristics, as it learns relevant priors directly from the data. Correspondances derive from a partial assignment between the two sets of keypoint between the two sets of keypoints. For the integration into downstream tasks and better interpretability, each possible correspondence should have a confidence value.
Attentional Graph Neural Network
Consider two images A and B, each with local features which are a set of keypoint positions p (x, y image coordinates and detection confidence c) and associated visual descriptors d (extracted by SuperPoint).
The first component uses a keypoint encoder to map keypoint positions p and their visual descriptors d into a single vector, and then uses alternating self- and cross-attention layers (repeated L times) to create more powerful representations f
The Attentional Graph Neural Network block is responsible of computing matching descriptors by letting the feature communicate with each other , alternating self- and cross- attention layers
We embed the keypoint position into a high-dimensional vector with a Multilayer Perceptron (MLP), This encoder enables the graph network to later reason about both appearance and position jointly, especially when combined with attention.
We consider a single complete graph whose nodes are the keypoints of both images. The graph has two types of undirected edges. Intra-image edges, or self edges, connect keypoints i to all other keypoints within the same image. Inter-image edges, or cross edges, connect keypoints i to all keypoints in the other image. We use the message passing formulation to propagate information along both types of edges. The resulting multiplex GNN starts with a high-dimensional state for each node and computes at each layer an updated representation by simultaneously aggregating messages across all given edges for all nodes.
The message is computed as a weighted average of the values and it's computed by an attention mechanism that performs the aggregation.
Each layer has its own projection parameters, learned and shared for all keypoints of both images. In practice, we improve the expressivity with multi-head attention, providing maximum flexibility and to increase the model expressivity as the network can learn to focus on a subset of keypoints based on specific attributes. SuperGlue can retrieve or attend based on both appearance and keypoint location. This includes attending to a nearby keypoint and retrieving the relative positions of similar or salient keypoints. This enables representations of the geometric transformation and the assignment. The final matching descriptors f are linear projections
Optimal Matching Layer
produces a partial assignment matrix. the assignment P can be obtained by computing a score matrix S for all possible matches and maximizing the total score. express the pairwise score as the similarity of matching descriptors the matching descriptors are not normalized, and their magnitude can change per feature and during training to reflect the prediction confidence.
To let the network suppress some keypoints, we augment each set with a dustbin so that unmatched keypoints are explicitly assigned to it. dustbins have also been used by SuperPoint to account for image cells that might not have a detection. We augment the scores S to S' by appending a new row and column, the point-to-bin and bin-to-bin scores, filled with a single learnable parameter each dustbin has as many matches as there are keypoints in the other set
Sinkhorn Algorithm: The solution of the above optimization problem corresponds to the optimal transport between discrete distributions a and b with scores S', this is a differentiable version of the Sinkhorn algorithm, classically used for bipartite matching, that consists in iteratively normalizing exp(S') along rows and columns, similar to row and column Softmax.
Loss function
both the graph neural network and the optimal matching layer are differentiable – this enables back-propagation from matches to visual descriptors SuperGlue is trained in a supervised manner from ground truth matches, estimated from ground truth relative transformations – using poses and depth maps or homographies. This also lets us label some keypoints as unmatched if they do not have any reprojection in their vicinity
This supervision aims at simultaneously maximizing the precision and the recall of the matching.
Comparisons to related work
The SuperGlue architecture is equivariant to permutation of the keypoints within an image. Unlike other handcrafted or learned approaches, it is also equivariant to permutation of the images, which better reflects the symmetry of the problem and provides a beneficial inductive bias. Also, the optimal transport formulation enforces reciprocity of the matches.
Attention, as used by SuperGlue, is a flexible and powerful context aggregation mechanism, that allows to treat keypoints accordingly to their weight. SuperGlue can jointly reason about appearance and position and it only needs local features, learned or handcrafted, and can thus be a simple drop-in replacement for existing matchers. SuperGlue borrows the self-attention from the Transformer, but embeds it into a graph neural network, and additionally introduces the cross-attention, which is symmetric. This simplifies the architecture and results in better feature reuse across layers.
Implementation Details:
SuperGlue can be combined with any local feature detector and descriptor but works particularly well with SuperPoint, which produces repeatable and sparse keypoints. Visual descriptors are bilinearly sampled from the semi-dense feature map. All intermediate representations (key, query value, descriptors) have the same dimension D = 256 as the SuperPoint descriptors. L=9 layers of alternating multi-head self- and cross-attention with 4 heads each, T=100 Sinkhorn iterations. A forward pass takes on average 69 ms (15 FPS) for an indoor image pair. To allow for data augmentation, SuperPoint detect and describe steps are performed on-the-fly as batches during training. A number of random keypoints are further added for efficient batching and increased robustness.
Experimental results:
All learned methods, including SuperGlue, are trained on ground truth correspondences, found by projecting keypoints from one image to the other. We generate homographies and photometric distortions on-the-fly.
Match precision (P) and recall (R) are computed from the ground truth correspondences, performed both with RANSAC and DLT (Direct Linear Transformation), SuperGlue is sufficiently expressive to master homographies, achieving 98% recall and high precision.
SuperPoint and SuperGlue complement each other well since repeatable keypoints make it possible to estimate a larger number of correct matches even in very challenging situations.
Visualization of how the attention layers work
Strenghts
Limitations