Read Papers - Githubissues

CJcool06 commented 1 year ago

Read these papers:

CJcool06 commented 1 year ago

MaskFormer

Instead of doing per-pixel classification for segmentation they propose mask-classification. It works by segmenting the classes in the images as masks and then predicting the class label for each mask. Their method is easily applicable to both semantic and instance segmentation tasks.

They use a Transformer decoder block to encode global information from the input features with N (N = # of segments) learned positional encodings. The N outputs are then fed to an MLP that decides the class predictions of the segments.

A deeper dive would be required to find ways to improve research in this area.

Notes:

They used pre-trained backbones
They use a pixel-decoder
Fusion is required in two places
Their method seems ad-hoc and not particularly elegant
Their overall idea of using mask-classification seems promising

CJcool06 commented 1 year ago

Graphormer

The authors leverage the Transformer by encoding the structural information of graphs. They are as follows:

Centrality Encoding: Some nodes are more connected than others (ie. more important)
Spatial Encoding: Since graphs are not arranged as a sequence, they use a function to map the relation between two nodes in a graph. They add the result for each node pair to the attention matrix (before softmax)
Edge Encoding: Each edge contained in the shortest path between two nodes should be leveraged in representing the whole graph (ie. to take advantage of attention). Their method is essentially summing the result of the edge-feature multiplied by a weight embedding. They add the result to the attention matrix, same as the spatial encoding

An interesting note is that they use bias terms for the spatial and edge encoding methods. I think that this allows the embeddings to be more expressive within themselves.

CJcool06 commented 1 year ago

TrackFormer

The main takeaways are as follows:

They modify the vanilla Transformer architecture and is therefore nearly a solely-transformer implementation. They use a CNN for image feature extraction.
It can be used for multi-object tracking and instance segmentation across frames. Their instance segmentation requires additional CNN heads to be added.
They define "tracks" that are used by the following frame's attention mechanism.
They initialise new tracks by using learned object queries.
Track queries essentially follow objects to the following frames.
Self-attention over the joint set of track and object queries allows for detection of new objects and avoids re-detection of already tracked objects.

Limitations:

Confusing results as they use public and private datasets, they also use DETR attention for selected private datasets.
They don't disclose number of parameters or FLOPs.
The number of object queries must largely exceed the maximum number of objects that could enter the frame.

I like the simplicity of their method. They only alter the Q, K, and V that are fed to the attention mechanisms in both the encoder and decoder blocks. This simplicity, however, makes me think that a more surgical approach to the attention mechanism may yield more favourable results.

I found the problem of mapping object detections and segmentations across video frames intriguing. Perhaps a direction I continue looking into for my first project.

CJcool06 commented 1 year ago

SuperGlue

Another great paper! Their method uses keypoint positions and keypoint descriptors from an existing method (ie. SuperPoint). The important and relevant steps of the architecture are described using its two-module structure:

Attention Graph Neural Network:

The problem is formulated as a GNN
Encodes and embeds the keypoint position to the descriptor
Self-attention with the keypoints in image A and B individually
Cross-attention with the keypoints in image A with image B

Optimal Matching Layer:

Compute attention matrix over the matching descriptors for image A and B from the Attention Graph Neural Network module
A "dustbin" row and column is added to the attention weight matrix so that unmatched keypoints are explicitly assigned to it. They solve an optimisation problem to correct for multiple dustbin assignments. It's a neat idea and could be slightly modified (to perhaps not need explicit correction) and applied to the TrackFormer architecture, as they used a threshold method for removing old tracks and initialising new tracks

Their method performed empirically well with homography estimation and pose estimation using a front-end image descriptor.

Questions:

Was using a GNN a good idea?

ChrisFuscoMasters / TransformerLib

Read Papers #12