Closed CJcool06 closed 1 year ago
MaskFormer
Instead of doing per-pixel classification for segmentation they propose mask-classification. It works by segmenting the classes in the images as masks and then predicting the class label for each mask. Their method is easily applicable to both semantic and instance segmentation tasks.
They use a Transformer decoder block to encode global information from the input features with N (N = # of segments) learned positional encodings. The N outputs are then fed to an MLP that decides the class predictions of the segments.
A deeper dive would be required to find ways to improve research in this area.
Notes:
Graphormer
The authors leverage the Transformer by encoding the structural information of graphs. They are as follows:
An interesting note is that they use bias terms for the spatial and edge encoding methods. I think that this allows the embeddings to be more expressive within themselves.
TrackFormer
The main takeaways are as follows:
Limitations:
I like the simplicity of their method. They only alter the Q, K, and V that are fed to the attention mechanisms in both the encoder and decoder blocks. This simplicity, however, makes me think that a more surgical approach to the attention mechanism may yield more favourable results.
I found the problem of mapping object detections and segmentations across video frames intriguing. Perhaps a direction I continue looking into for my first project.
SuperGlue
Another great paper! Their method uses keypoint positions and keypoint descriptors from an existing method (ie. SuperPoint). The important and relevant steps of the architecture are described using its two-module structure:
Attention Graph Neural Network:
Optimal Matching Layer:
Their method performed empirically well with homography estimation and pose estimation using a front-end image descriptor.
Questions:
Read these papers: