Sik-Ho Tang | Review -- DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features.

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features.

NorbertZheng commented 1 year ago

Overview

DeepCluster, K-Mean Clustering to Generate Pseudo-Labels, a Pretext Task for Self-Supervised Learning.

Illustration of the Proposed DeepCluster.

In this story, Deep Clustering for Unsupervised Learning of Visual Features, DeepCluster, by Facebook AI Research, is reviewed. In this paper:

DeepCluster, a clustering method is proposed that jointly learns the parameters of a neural network and the cluster assignments of the resulting features.
DeepCluster iteratively
- groups the features with a standard clustering algorithm, k-means,
- and uses the subsequent assignments as supervision to update the weights of the network.

This is a paper in 2018 ECCV with over 900 citations.

NorbertZheng commented 1 year ago

K-Mean is applied to learned features, thus generating pseudo-label, EM-learning???

NorbertZheng commented 1 year ago

Notations for Supervised Learning

Before talking about DeepCluster, let’s define some notations using supervised learning.

Given a training set $X=\{x{1}, x{2}, ..., x_{N}\}$ of $N$ images, we want to find a parameter $\theta^{*}$ such that the mapping $f$ produces good general-purpose features.

These parameters are traditionally learned with supervision, i.e. each image $x{n}$ is associated with a label $y{n}$ in $\{0, 1\}^{k}$. This label represents the image’s membership to one of $k$ possible predefined classes.

A parametrized classifier $g{W}$ predicts the correct labels on top of the features $f(x{n})$.

Therefore, the loss function is (Eq. (1)):

where $l$ is the multinomial logistic loss.

This cost function is minimized using mini-batch stochastic gradient descent and backpropagation to compute the gradient.

NorbertZheng commented 1 year ago

DeepCluster as Pretext Task in Self-Supervised Learning

Top: k-Means Clustering on Vectors Produced by CNN; Bottom: Using the clustering results as psuedo labels for backpropagation.

DeepCluster Procedures

The idea of this work is to exploit this weak signal to bootstrap the discriminative power of a convnet.

We cluster the output of the convnet and use the subsequent cluster assignments as “pseudo-labels” to optimize Eq. (1). This deep clustering (DeepCluster) approach iteratively learns the features and groups them.

A standard clustering algorithm, $k$-means, is used.

$k$-means takes a set of vectors as input, in our case the features $f(x_{n})$ produced by the convnet, and clusters them into $k$ distinct groups based on a geometric criterion.

More precisely, it jointly learns a $d\times k$ centroid matrix $C$ and the cluster assignments $y_{n}$ of each image $n$ by solving the following problem (Eq. (2)):

Overall, DeepCluster alternates between clustering the features to produce pseudo-labels using Eq. (2) and updating the parameters of the convnet by predicting these pseudo-labels using Eq. (1).

NorbertZheng commented 1 year ago

Avoiding Trivial Solutions

Empty Cluster

An optimal decision boundary is to assign all of the inputs to a single cluster. This issue is caused by the absence of mechanisms to prevent from empty clusters.

More precisely, when a cluster becomes empty, a non-empty cluster is randomly selected and its centroid is used with a small random perturbation as the new centroid for the empty cluster. The points are then reassigned belonging to the non-empty cluster to the two resulting clusters.

Trivial Parametrization

If the vast majority of images is assigned to a few clusters, the parameters $\theta$ will exclusively discriminate between them.

A strategy to circumvent this issue is to sample images based on a uniform distribution over the classes, or pseudo-labels.

NorbertZheng commented 1 year ago

DeepCluster Analysis

Normalized Mutual Information (NMI)

(a): Evolution of the clustering quality along training epochs; (b): evolution of cluster reassignments at each clustering step; (c): validation mAP classification performance for various choices of k.

Normalized Mutual Information (NMI), is used to measure the performance: where $I$ denotes the mutual information and $H$ the entropy.

If the two assignments A and B are independent, the NMI is equal to 0. If one of them is deterministically predictable from the other, the NMI is equal to 1.

(a): The dependence between the clusters and the labels increases over time, showing that the learnt features progressively capture information related to object classes.
(b): The NMI is increasing, meaning that there are less and less reassignments and the clusters are stabilizing over time.
(c): The best performance is obtained with k= 10,000. Given that ImageNet has 1000 classes. Apparently some amount of over-segmentation is beneficial.

NorbertZheng commented 1 year ago

Visualizations

Filter visualization and top 9 activated images from a subset of 1 million images from YFCC100M.

As expected, deeper layers in the network seem to capture larger textural structures.

Top 9 activated images from a random subset of 10 millions images from YFCC100M for target filters in the last convolutional layer.

The filters on the top row contain information about structures that highly correlate with object classes. The filters on the bottom row seem to trigger on style, like drawings or abstract shapes.

NorbertZheng commented 1 year ago

DeepCluster Performance

Linear Classification on Activations on ImageNet & Places

Linear classification on ImageNet and Places using activations from the convolutional layers of an AlexNet as features.

ImageNet

Model:

A linear classifier is trained on top of different frozen convolutional layers.

Results:

On ImageNet, DeepCluster outperforms the state of the art from conv2 to conv5 layers by 1−6%. The largest improvement is observed in the conv3 layer.

Finally, the difference of performance between DeepCluster and a supervised AlexNet grows significantly on higher layers: at layers conv2-conv3 the difference is only around 4%, but this difference rises to 12.3% at conv5,

marking where the AlexNet probably stores most of the class level information.

If a MLP is trained on the last layer, DeepCluster outperforms the state of the art by 8%.

Places

DeepCluster yields conv3-4 features that are comparable to those trained with ImageNet labels.

This suggests that when the target task is sufficiently far from the domain covered by ImageNet, labels are less important.

NorbertZheng commented 1 year ago

If the target domain is exactly the same with the source domain, supervise learning is the best method. E.g. the same subject watching different pictures?
If the target domain is different from the source domain, there is little gap between supervise learning and un-supervise learning. E.g. different subjects watching the same pictures or different pictures?

NorbertZheng commented 1 year ago

Pascal VOC

Comparison of the proposed approach to state-of-the-art unsupervised feature learning on classification, detection and segmentation on Pascal VOC.

DeepCluster outperforms previous unsupervised methods, such as Context Prediction [13], Context Encoders [46], Colorization [71], Split-Brain Auto [72], Jigsaw Puzzles [42], on all three tasks, in every setting.

The improvement with fine-tuning over the state of the art is the largest on semantic segmentation (7.5%).
On detection, DeepCluster performs only slightly better than previously published methods. Interestingly, a fine-tuned random network performs comparatively to many unsupervised methods, but performs poorly if only fc6–8 are learned.

NorbertZheng commented 1 year ago

YFCC100M

Impact of the training set on the performance of DeepCluster measured on the Pascal VOC transfer tasks.

In YFCC100M, object classes are severly unbalanced, leading to a data distribution less favorable to DeepCluster.
This experiment validates that DeepCluster is robust to a change of image distribution, leading to state-of-the-art general-purpose visual features even if this distribution is not favorable to its design.

NorbertZheng commented 1 year ago

AlexNet vs VGGNet

Pascal VOC 2007 object detection with AlexNet and VGG16.

In the previous experiment, AlexNet is used. Here a deeper network VGGNet is tried.

Training the VGG-16 with DeepCluster gives a performance above the state of the art, bringing us to only 1.4% below the supervised topline.

NorbertZheng commented 1 year ago

Image Retrieval

mAP on instance-level image retrieval on Oxford and Paris dataset with a VGG16.

The above table suggests that image retrieval is a task where the pre-training is essential and studying it as a down-stream task could give further insights about the quality of the features produced by unsupervised approaches.

NorbertZheng commented 1 year ago

One of the major issues is that k-mean clustering takes quite plenty of time.

NorbertZheng commented 1 year ago

Reference

[2018 ECCV] [DeepCluster] Deep Clustering for Unsupervised Learning of Visual Features.

NorbertZheng / read-papers