Closed CJcool06 closed 1 year ago
A vanilla Transformer architecture is used: CNN backbone produces features, features (with position encoding) are passed to an encoder stack. The output is given as keys and values to the decoder stack, queries are learned positional embeddings (called object queries). The decoder output is passed to an FFN that predicts class and bounding box (centre pos, height, width).
What's interesting is that the final block in the encoder has an attention map that looks like a segmentation mask for each pixel. I have an idea that can leverage this.
Another interesting note is that the decoder was shown to, as one of its tasks, remove any duplicate predictions. I have an idea about how to remove this dependency.
All in all, awesome paper. I see lots of recent works are leveraging this architecture and achieving great results on benchmarks.
The authors present the first unified framework for part panoptic segmentation (PPS) and show that joint learning of thing, stuff, and part is beneficial. The paper was difficult to understand and their method is largely taken from previous works. They state that they wanted to create the first baseline with a unified architecture for this task. Their architecture is weird and difficult to understand, so is their inference method.
Great paper. They propose to alleviate DETR's issues of slow convergence and limited spatial resolution (number of input pixels) by changing the attention modules to only attend to a small set of key sampling points around a reference point. Their results show that they can achieve better object detection performance (especially on small objects) and faster convergence (10x less training epochs).
I've roughly detailed their main contributions below.
A way to learn data-dependant sparse attention, as opposed to using a fixed sparse attention. This brings along the efficiency without sacrificing global attention and is inspired by deformable convolution (Dai et al., 2017)[https://arxiv.org/abs/1703.06211]. Their idea is to choose a reference point/query from the input image and sample N points around it, for each head in MHA. This reduces the computational complexity as we don't compare the query to all keys, only N chosen keys. I believe this is done to speed up the convergence by forcing the attention map to be sparse from the beginning, as it is shown that attention maps turn out that way anyways (Child et al., 2019)[https://arxiv.org/abs/1904.10509].
Multi-scale feature representation in multi-head attention is proposed to alleviate the difficulty of representing objects at vastly different scales. Their attention module can share information across multi-scale feature maps via the attention mechanism, without the help of commonly used feature pyramid networks (Lin et al., 2016)[https://arxiv.org/abs/1612.03144].
These two contributions are what form their proposed deformable attention module.
I believe this paper shows some validity to the idea of using learned pivot points/queries in the attention module to attend to specific areas/parts of an object. These pivot points would create a part/object mask that can be used for both object detection and segmentation. Possibly by using only an encoder.
A contrastive loss approach is used to train a 12-layer transformer encoder stack on input images and another text encoder transformer on input text. Inference is done by comparing the respective image and text output representations in latent space. Conceptually, correct image-text pairs should be close together and incorrect pairs should be far away (measured via dot product).
Their main contribution, aside from their contrastive loss, is adding two grouping blocks in the 12-layer image encoder. This represents a hierarchical transformer (encoder block), as the number of inputs decreases after each grouping block. For example,
The outputs of the first 6 encoder blocks (let's say N outputs) are grouped into one of M classes by using group tokens. Then, they are merged by class into M "segments", where M < N.
The idea behind using text for bottom-up segmentation is a neat idea and their grouping block seems like a good idea which shows promising results.
However, I am not confident in the authors' results as they seem too good to be true and their comparisons seem sketchy. It also seems like a computationally expensive method due to requiring two detached transformer networks and they don't disclose model parameters or FLOPS.
The authors note the similarity between the MHSA module and semantic affinity. They apply their method for weakly-supervised semantic segmentation (WSSS).
Put simply, this work tries to create a trainable end-to-end architecture that can use attention maps for semantic segmentation. They do this by using the Segformer MiT architecture (backbone and MLP segmentation head) and then creating a training scheme that introduces pseudo-labels for segmentation and affinity maps, and introduces a pixel refinement method for clearing up incorrectly labelled labels. They essentially have three auxiliary losses.
While this work shines a light on using attention maps for segmentation, I believe there are brighter ways to implement and apply this idea.
Note: Segformer MiT is similar to ViT but with an efficient attention, produces multi-scale features, and has patch overlapping.
Jack would like me to read and summarise these papers:
Additionally, I think the following papers would also be beneficial: