ChrisFuscoMasters / TransformerLib

Facilitates my research
0 stars 0 forks source link

Ideas and Papers #20

Open CJcool06 opened 1 year ago

CJcool06 commented 1 year ago

Unread papers that might give a new idea:

Already read and noted papers that could be useful:

CJcool06 commented 1 year ago

Segment Anything

The model architecture contains a large MAE ViT image encoder, a prompt encoder (CLIP), and a lightweight mask decoder (2x transformer block).

Their model is used in a 3-stage data engine to produce 1 billion masks on 11 million images.

The significance of this paper lies in the modular model design, the prompt encoder, and the huge publicly released dataset provided by the data engine. They present their pre-trained model as a 'foundation' model that can be fine-tuned for downstream tasks. Impressive zero-shot results were shown for edge-detection, semantic segmentation, and instance segmentation.

Very impressive paper overall. Perhaps the dataset could be used in some manner down the line.

Notes:

CJcool06 commented 1 year ago

Beyond mAP

The paper highlights how mAP (for box predictions) can be gamed by models producing 'hedge' predictions. These predictions marginally differ from the previous high-confidence predictions, they occur in the high-recall regime and are ultimately low-confidence. These hedge predictions result in slightly increasing the AUC and thus the AP. So, these low-confidence and largely incorrect predictions are rewarded in terms of mAP.

The authors introduce a way of calculating this error by constructing a graph containing all detections, with edges connecting all detections with an IoU above a threshold. Their algorithm calculates the weakest connection in a path between two nodes, then chooses the maximum weakest connection out of all the possible paths between the two nodes. This is defined as the connectivity of two nodes. The total connectivity is calculated and then averaged over all detections. This gives us the final score.

This by itself is not a good measure, however it effectively captures spatial hedging successfully since spatially perturbed predictions will form densely connected graphs.

In a strange turn of events, the authors then present a new way of gaming mAP in semantic segmentation by re-ranking each mask instance based on its 'agreement' with a semantic segmentation output.

Notes:

CJcool06 commented 1 year ago

Masked-Attention Mask Transformer

Their new work (Mask2Former) is an improvement of their previous MaskFormer.

The improvements are:

They show SOTA results for each segmentation task (semantic, instance, panoptic). They used the same architecture, although retrained, for each task.

The paper is well written and their ablation studies are good quality.

CJcool06 commented 1 year ago

Evaluating Large-Vocabulary Object Detectors

The authors show that the standard metric for evaluating object detection (Average Precision) can be gamed in large vocabulary, high instance count conditions. A simple re-ranking policy to the detections can increase AP substantially.

The cause of this is the detections-per-image limit. For example, with a maximum of 50 detections per image, a vocabulary of 100 classes will be forced to learn improper cross-category calibration to achieve the highest possible AP.

This fix to this is more difficult than increasing the detections-per-image limit in practice, as we cannot evaluate an infinite number of detections. The authors propose two new metrics:

AP(fixed): Remove the detections-per-image limit and add a detections-per-class limit for the entire dataset. Ie. 'car' can only be predicted a maximum of 1,000 times during evaluation. However, this metric is invariant to cross-category calibration.

AP(pool): Explicitly evaluates detections across all classes together. Detections are pooled across all classes to generate a single Precision-Recall curve across all classes, then the AP is computed on this curve.

Good paper.

CJcool06 commented 1 year ago

Probabilistic Two-Stage Detection

A probabilistic two-stage detector is faster and more accurate than both its one- and two-stage precursors.

One-stage detectors jointly infer the location and class likelihood in a probabilistically sound framework. Two-stage detectors first find potential objects and their location, then in the second stage classifies these potential objects.

The maths to show their probabilistic interpretation of two-stage detection was simple and good. It might be worth thinking about two-stage probabilistic detectors in my project.

CJcool06 commented 1 year ago

Pointly-Supervised Instance Segmentation

They propose a simple point annotation scheme to collect weak supervision labels for instance segmentations. In addition to bounding boxes, a set of points are uniformly sampled inside each bounding box and subsequently labelled as foreground or background.

Results of comparing baselines trained on mask labels vs point labels show less than a 5% reduction in AP performance on validation sets. These results were reproduced across four popular segmentation datasets (COCO, LVIS, PASCAL VOC, and Cityscapes).

Point annotations are ~5 times faster to label than polygon-based mask annotations on COCO.

It's apparent that this work was a pilot for FAIR's subsequent paper Segment Anything where they used a tweaked version for their data engine and model training.

CJcool06 commented 1 year ago

Masked Autoencoders Are Scalable Vision Learners

Another impactful paper by Kaiming He that is sure to impact the computer vision field going forward.

The paper devises a meaningful and simply self-supervisory task - randomly mask out a high proportion of the input image and then force the network to reconstruct it.

To apply this self-supervisory task, they develop an asymmetric encoder-decoder architecture. The encoder is a vanilla ViT that only receives the visible, unmasked patches as input. This allows training very large encoders with only a fraction of the compute and memory. The decoder is a set of transformer blocks.

My notes for this paper would not do it justice, so I refer back to the paper for a proper understanding.