Vishal Rajput | Constrasrive learning explained.

NorbertZheng commented 1 year ago

Vishal Rajput. Constrasrive learning explained.
Roushanak Rahmat. Contrastive Learning TensorFlow Tutorial.

NorbertZheng commented 1 year ago

Overview

We all know that

getting hands-on clean and labeled data is extremely rare in the real world.

Researchers have been trying to develop methods that work with partially labeled data for years. There are quite a few techniques in Semi-supervised learning that works quite decently with partially labeled data. But still, most of them suffer significantly in the case of Deep learning. In this blog post, we are going to discuss a strategy that doesn’t require any labels, and it's called Contrastive Learning. So, without further ado let’s dive deep into the concept of contrastive learning.

Semi-Supervised Learning: A small amount of data that is correctly labeled data is used with large unlabeled data. The model makes predictions on the unlabeled data and where it is very sure, those samples are added to the next iteration of model training. It’s an iterative process in which models keep getting better with more and more trained data.
Self-Supervised Learning: In the self-supervised learning technique, the model depends on the underlying structure of data to predict outcomes. It involves no labeled data.

Note: 80% of the time spent in a supervised learning ML project is invested in acquiring and cleaning the data for model training.

NorbertZheng commented 1 year ago

What is Contrastive Learning?

Contrastive Learning is a technique that is used generally in vision tasks lacking labeled data. By using the principle of contrasting samples against each other it learns

attributes that are common between data classes,
attributes that set apart a data class from another.

As the name suggests, samples are contrasted against each other, and those belonging to the same distribution or class are pulled together in the embedding space. In contrast, those belonging to different distributions are pushed against each other.

Therefore, contrastive learning is generally considered to be a form of self-supervised learning, because it does not require labeled data from external sources in order to train the model to predict the difference or relationship between two input items. It is often used for representation learning, where the goal is to learn useful and meaningful representations of the input data.

Image from Contrastive Self-Supervised Learning | Ankesh Anand.

In contrastive learning, the model is presented with pairs of items and is trained to predict whether the two items are related or not. For example, the model might be presented with pairs of images and asked to predict whether the images are of the same object or not. The model is then trained to minimize the error in its predictions by adjusting its internal representations of the input data.

Contrastive learning can be used to learn pixel-level features from images that are similar to the way that humans process visual information.

In particular, contrastive learning can be used to learn features that are invariant to certain transformations, such as translation or rotation, which are important for recognizing objects in natural images.

NorbertZheng commented 1 year ago

Similarity/Difference measurement

$x+$ is data point similar to $x$, referred to as a positive sample.
$x-$ is a data point dissimilar to $x$, referred to as a negative sample.

NorbertZheng commented 1 year ago

How does Contrastive Learning work in Vision AI?

Basically, contrastive learning tries to put similar things into the same basket and anything dissimilar not in that particular basket. This method is very similar to how humans understand the world. We don’t need to be shown every car in the world to identify a new car. We create some features associated with cars in our mind and anything that shows a similar feature is categorized as a car.

Positive and negative sample.

The basic principle behind contrastive learning is:

Select a data sample (called the anchor).
A data point belonging to the same category or distribution as anchor’s (called the positive sample).
Another data point belonging to a different category or distribution as anchor’s (called the negative sample).
The Contrastive learning model tries to minimize the distance between the anchor and positive samples, i.e., the samples belonging to the same distribution, in the latent space, and at the same time maximize the distance between the anchor and the negative samples.

Instance Discrimination Method

But how do we actually push and pull different samples? In this method,

the image is made to undergo transformations and the transform images are used as positive samples to the anchor image.

For example, if we select an image of a human as the anchor, we can jitter the image or convert it to grayscale to use as the positive sample. The negative sample can be any other image in the dataset.

The framework of the instance discrimination-based contrastive learning.

Different types of image transformation:

Image Subsampling/Patching Method

This method breaks the single image into multiple patches of a fixed dimension (overlapping of patches is allowed). It uses the different parts of the same image as positive samples and other patches from different images are used as the negative samples.

NorbertZheng commented 1 year ago

Supervised Contrastive Learning (SCL) vs. Self-Supervised Contrastive Learning (SSCL)

In supervised Contrastive Learning (SCL), we have some labeled data, and the data with the same label is used as positive samples and the rest is used as negative samples to the anchor image.
In Self-Supervised Contrastive Learning (SSCL), due to the absence of class labels, the positive and negative samples are generated from the anchor image itself- by various data augmentation techniques.
- Augmented versions of all other images are considered “negative” samples.

NorbertZheng commented 1 year ago

Contrastive learning frameworks

SimCLR

This model is developed by Google Brain, it is a framework for contrastive learning of visual representations. Its basic working principle is to

maximize the agreement between different augmented versions of the same sample using a contrastive loss in the latent space.

The framework of the SimCLR method is shown below.

A simple framework for contrastive learning of visual representations. Two separate data augmentation operators are sampled from the same family of augmentations ($t \sim \mathcal{T}$ and $t' \sim \mathcal{T}$) and applied to each data example to obtain two correlated views. A base encoder network $f(\cdot)$ and a projection head $g(\cdot)$ are trained to maximize agreement using a contrastive loss. After training is completed, we throw away the projection head $g(\cdot)$ and use encoder $f(\cdot)$ and representation $h$ for downstream tasks.

A nicer illustration is as follows: Image from The Illustrated SimCLR Framework (amitness.com).

NorbertZheng commented 1 year ago

Modules within SimCLR

Data augmentation module: Transforms a given data sample (image) randomly to create two views of the same example ( $x{i}$ and $x{j}$ in the diagram above). These represent the positive pairs. The SimCLR framework applies the following three augmentations:
- random crop and resizing (with random flip),
- color distortions,
- Gaussian blur.
According to the results obtained by the authors, random cropping and color distortion are essential for achieving good performance.
Neural network-based encoder: Denoted by $f(\cdot)$ in the architecture diagram, to extract representative vectors from the augmented data samples. Although we can use any network as the backbone encoder, the authors chose the ResNet model for simplicity. The features are extracted after the final averaging pooling layer of the ResNet-50 model.
Small neural network projection head: Denoted by $g(\cdot)$ is incorporated to map the extracted representative vectors to a common latent space. This will allow the contrastive loss to be implemented. The authors use a simple Multi-Layer Perceptron with one hidden layer for this purpose. They investigated and found that applying the contrastive loss in this latent space produces better results than directly evaluating the loss over the features extracted from ResNet-50.
Contrastive loss function: Contrastive loss is defined as the negative log-likelihood of the correct data point, given the target data point and the two data points being compared.

NorbertZheng commented 1 year ago

Contrastive loss function

The loss function used here is called Normalized Temperature-scaled Cross-Entropy or NT-Xent loss. It is a modification of the multi-class $N$-pair loss with an addition of the temperature ( $T$ ) parameter. In the multi-class N-pair loss sampling,

instead of sampling a single negative sample, an $N$ number of negative samples are sampled along with one anchor and one positive sample.

Image from The Illustrated SimCLR Framework (amitness.com).

It’s worth noting that NT-Xent is not directly related to the cosine transform. The cosine transform is often used in contrastive learning, a machine learning technique for training models to recognize similarities and differences between pairs of data points, but it is not a component of the NT-Xent loss function.

Image from The Illustrated SimCLR Framework (amitness.com).

NorbertZheng commented 1 year ago

Contrastive Learning Tensorflow Tutorial

Let’s implement the contrastive learning to learn pixel-level features from the cifar10 dataset of images using TensorFlow:

import numpy as np
import tensorflow as tf

# Load the dataset of images
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Preprocess the data
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

# Define the model architecture
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(32, 32, 3)),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(64, 3, activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Define the contrastive loss function
def contrastive_loss(y_true, y_pred):
    margin = 1
    return tf.reduce_mean(y_true * tf.square(y_pred) +
                         (1 - y_true) * tf.square(tf.maximum(margin - y_pred, 0)))

# Compile the model with the contrastive loss function
model.compile(optimizer='adam', loss=contrastive_loss, metrics=['accuracy'])

# Define a function to generate pairs of images for training
def generate_pairs(x, y):
    while True:
        indices = np.random.permutation(len(x))
        for i in range(0, len(x), 2):
            a = x[indices[i]]
            b = x[indices[i+1]]
            yield ([a, b], [y[indices[i]], y[indices[i+1]]])

# Use the `fit` method to train the model on the generated pairs of images
model.fit(generate_pairs(x_train, y_train), epochs=5,
          validation_data=generate_pairs(x_test, y_test))

NorbertZheng commented 1 year ago

NorbertZheng / read-papers