NorbertZheng / read-papers

My paper reading notes.
MIT License
8 stars 0 forks source link

Sik-Ho Tang | Review -- DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features. #132

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features.

NorbertZheng commented 1 year ago

Overview

DeepCluster, K-Mean Clustering to Generate Pseudo-Labels, a Pretext Task for Self-Supervised Learning.

image Illustration of the Proposed DeepCluster.

In this story, Deep Clustering for Unsupervised Learning of Visual Features, DeepCluster, by Facebook AI Research, is reviewed. In this paper:

This is a paper in 2018 ECCV with over 900 citations.

NorbertZheng commented 1 year ago

K-Mean is applied to learned features, thus generating pseudo-label, EM-learning???

NorbertZheng commented 1 year ago

Notations for Supervised Learning

Before talking about DeepCluster, let’s define some notations using supervised learning.

Given a training set $X=\{x{1}, x{2}, ..., x_{N}\}$ of $N$ images, we want to find a parameter $\theta^{*}$ such that the mapping $f$ produces good general-purpose features.

These parameters are traditionally learned with supervision, i.e. each image $x{n}$ is associated with a label $y{n}$ in $\{0, 1\}^{k}$. This label represents the image’s membership to one of $k$ possible predefined classes.

A parametrized classifier $g{W}$ predicts the correct labels on top of the features $f(x{n})$.

Therefore, the loss function is (Eq. (1)): image

where $l$ is the multinomial logistic loss.

This cost function is minimized using mini-batch stochastic gradient descent and backpropagation to compute the gradient.

NorbertZheng commented 1 year ago

DeepCluster as Pretext Task in Self-Supervised Learning

image Top: k-Means Clustering on Vectors Produced by CNN; Bottom: Using the clustering results as psuedo labels for backpropagation.

DeepCluster Procedures

The idea of this work is to exploit this weak signal to bootstrap the discriminative power of a convnet.

We cluster the output of the convnet and use the subsequent cluster assignments as “pseudo-labels” to optimize Eq. (1). This deep clustering (DeepCluster) approach iteratively learns the features and groups them.

A standard clustering algorithm, $k$-means, is used.

$k$-means takes a set of vectors as input, in our case the features $f(x_{n})$ produced by the convnet, and clusters them into $k$ distinct groups based on a geometric criterion.

More precisely, it jointly learns a $d\times k$ centroid matrix $C$ and the cluster assignments $y_{n}$ of each image $n$ by solving the following problem (Eq. (2)):

image Overall, DeepCluster alternates between clustering the features to produce pseudo-labels using Eq. (2) and updating the parameters of the convnet by predicting these pseudo-labels using Eq. (1).

NorbertZheng commented 1 year ago

Avoiding Trivial Solutions

Empty Cluster

An optimal decision boundary is to assign all of the inputs to a single cluster. This issue is caused by the absence of mechanisms to prevent from empty clusters.

More precisely, when a cluster becomes empty, a non-empty cluster is randomly selected and its centroid is used with a small random perturbation as the new centroid for the empty cluster. The points are then reassigned belonging to the non-empty cluster to the two resulting clusters.

Trivial Parametrization

If the vast majority of images is assigned to a few clusters, the parameters $\theta$ will exclusively discriminate between them.

A strategy to circumvent this issue is to sample images based on a uniform distribution over the classes, or pseudo-labels.

NorbertZheng commented 1 year ago

DeepCluster Analysis

Normalized Mutual Information (NMI)

image (a): Evolution of the clustering quality along training epochs; (b): evolution of cluster reassignments at each clustering step; (c): validation mAP classification performance for various choices of k.

Normalized Mutual Information (NMI), is used to measure the performance: image where $I$ denotes the mutual information and $H$ the entropy.

If the two assignments A and B are independent, the NMI is equal to 0. If one of them is deterministically predictable from the other, the NMI is equal to 1.

NorbertZheng commented 1 year ago

Visualizations

image Filter visualization and top 9 activated images from a subset of 1 million images from YFCC100M.

As expected, deeper layers in the network seem to capture larger textural structures.

image Top 9 activated images from a random subset of 10 millions images from YFCC100M for target filters in the last convolutional layer.

The filters on the top row contain information about structures that highly correlate with object classes. The filters on the bottom row seem to trigger on style, like drawings or abstract shapes.

NorbertZheng commented 1 year ago

DeepCluster Performance

Linear Classification on Activations on ImageNet & Places

image Linear classification on ImageNet and Places using activations from the convolutional layers of an AlexNet as features.

ImageNet

Model:

Results:

Finally, the difference of performance between DeepCluster and a supervised AlexNet grows significantly on higher layers: at layers conv2-conv3 the difference is only around 4%, but this difference rises to 12.3% at conv5,

If a MLP is trained on the last layer, DeepCluster outperforms the state of the art by 8%.

Places

DeepCluster yields conv3-4 features that are comparable to those trained with ImageNet labels.

NorbertZheng commented 1 year ago
NorbertZheng commented 1 year ago

Pascal VOC

image Comparison of the proposed approach to state-of-the-art unsupervised feature learning on classification, detection and segmentation on Pascal VOC.

DeepCluster outperforms previous unsupervised methods, such as Context Prediction [13], Context Encoders [46], Colorization [71], Split-Brain Auto [72], Jigsaw Puzzles [42], on all three tasks, in every setting.

NorbertZheng commented 1 year ago

YFCC100M

image Impact of the training set on the performance of DeepCluster measured on the Pascal VOC transfer tasks.

NorbertZheng commented 1 year ago

AlexNet vs VGGNet

image Pascal VOC 2007 object detection with AlexNet and VGG16.

In the previous experiment, AlexNet is used. Here a deeper network VGGNet is tried.

Training the VGG-16 with DeepCluster gives a performance above the state of the art, bringing us to only 1.4% below the supervised topline.

NorbertZheng commented 1 year ago

Image Retrieval

image mAP on instance-level image retrieval on Oxford and Paris dataset with a VGG16.

The above table suggests that image retrieval is a task where the pre-training is essential and studying it as a down-stream task could give further insights about the quality of the features produced by unsupervised approaches.

NorbertZheng commented 1 year ago

One of the major issues is that k-mean clustering takes quite plenty of time.

NorbertZheng commented 1 year ago

Reference