ChrisFuscoMasters / TransformerLib

Facilitates my research
0 stars 0 forks source link

Read & summarise related works #16

Closed CJcool06 closed 1 year ago

CJcool06 commented 1 year ago

Jack would like me to read and summarise these papers:

Additionally, I think the following papers would also be beneficial:

CJcool06 commented 1 year ago

DETR

A vanilla Transformer architecture is used: CNN backbone produces features, features (with position encoding) are passed to an encoder stack. The output is given as keys and values to the decoder stack, queries are learned positional embeddings (called object queries). The decoder output is passed to an FFN that predicts class and bounding box (centre pos, height, width).

What's interesting is that the final block in the encoder has an attention map that looks like a segmentation mask for each pixel. I have an idea that can leverage this.

Another interesting note is that the decoder was shown to, as one of its tasks, remove any duplicate predictions. I have an idea about how to remove this dependency.

All in all, awesome paper. I see lots of recent works are leveraging this architecture and achieving great results on benchmarks.

CJcool06 commented 1 year ago

Panoptic-PartFormer

The authors present the first unified framework for part panoptic segmentation (PPS) and show that joint learning of thing, stuff, and part is beneficial. The paper was difficult to understand and their method is largely taken from previous works. They state that they wanted to create the first baseline with a unified architecture for this task. Their architecture is weird and difficult to understand, so is their inference method.

CJcool06 commented 1 year ago

Deformable DETR

Great paper. They propose to alleviate DETR's issues of slow convergence and limited spatial resolution (number of input pixels) by changing the attention modules to only attend to a small set of key sampling points around a reference point. Their results show that they can achieve better object detection performance (especially on small objects) and faster convergence (10x less training epochs).

I've roughly detailed their main contributions below.

These two contributions are what form their proposed deformable attention module.

I believe this paper shows some validity to the idea of using learned pivot points/queries in the attention module to attend to specific areas/parts of an object. These pivot points would create a part/object mask that can be used for both object detection and segmentation. Possibly by using only an encoder.

CJcool06 commented 1 year ago

GroupViT

A contrastive loss approach is used to train a 12-layer transformer encoder stack on input images and another text encoder transformer on input text. Inference is done by comparing the respective image and text output representations in latent space. Conceptually, correct image-text pairs should be close together and incorrect pairs should be far away (measured via dot product).

Their main contribution, aside from their contrastive loss, is adding two grouping blocks in the 12-layer image encoder. This represents a hierarchical transformer (encoder block), as the number of inputs decreases after each grouping block. For example,

The outputs of the first 6 encoder blocks (let's say N outputs) are grouped into one of M classes by using group tokens. Then, they are merged by class into M "segments", where M < N.

The idea behind using text for bottom-up segmentation is a neat idea and their grouping block seems like a good idea which shows promising results.

However, I am not confident in the authors' results as they seem too good to be true and their comparisons seem sketchy. It also seems like a computationally expensive method due to requiring two detached transformer networks and they don't disclose model parameters or FLOPS.

CJcool06 commented 1 year ago

Affinity from Attention

The authors note the similarity between the MHSA module and semantic affinity. They apply their method for weakly-supervised semantic segmentation (WSSS).

Put simply, this work tries to create a trainable end-to-end architecture that can use attention maps for semantic segmentation. They do this by using the Segformer MiT architecture (backbone and MLP segmentation head) and then creating a training scheme that introduces pseudo-labels for segmentation and affinity maps, and introduces a pixel refinement method for clearing up incorrectly labelled labels. They essentially have three auxiliary losses.

While this work shines a light on using attention maps for segmentation, I believe there are brighter ways to implement and apply this idea.

Note: Segformer MiT is similar to ViT but with an efficient attention, produces multi-scale features, and has patch overlapping.