NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Edward Ma | A Gentle Introduction to Meta-Learning. #67

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Edward Ma. A Gentle Introduction to Meta-Learning.

NorbertZheng commented 1 year ago

Overview

Human begins can learn new things in a few examples, while

Can we build a model that consumes a few examples to pick up a new skill just like a human being? Meta-Learning (as know as learning to learn) was born to tackle this problem. Giving a few examples, the meta-learning model can learn and adapt to a new domain quickly.

In this series of meta-learning stories, we will go through the concept of meta-learning, several meta-learning approaches, and examples. You may visit the following to get familiar with meta-learning

In this story, we will cover terminology, idea of meta-learning, and introducing different approaches:

NorbertZheng commented 1 year ago

Terminology

Dataset

In general machine learning terminology, we only have a train set, testing set, and validation set for training, testing, and validation. In meta-learning, those names are renamed as meta-training set, meta-testing set and meta-validation set.

Support Set and Query Set

The support set is a set of records (input and label), while labels are distinct and different in every task. Query set is another set of records with a label for matching input to select a label.

N-way K-shot

N-way K-shot refers to

We have One-Shot Learning and Few-Shot Learning when there is just one label or few labels (less than a total number of labels). The key ideas are data transformation and knowledge sharing.

For example, we have 100 records (input and label), while there are ten distinct labels. In every batch, only K number of records (contains N labels) will be feed into the model. N does not need a match with a total number of distinct labels (i.e., 10 in this case). The labels are consistent for both the support set and a query set of the same task but vary across tasks.

Different from general model training, we only pass a subset of the label for model training. The model has the capability to predict unseen label during prediction time.

image Illustration of meta-learning terminology.

NorbertZheng commented 1 year ago

Approaches

Metric-Based

Matching Networks (Matching Nets) is proposed by Vinyals et al. (2016). The idea is simple, but the result is promising. After transforming the support set and query set to embeddings vector, authors use cosine distance (i.e. normalized similarity) to find the most similar data. Matching Nets Framework repeats the following steps in every task.

The above training process is pretty much like that of contrastive learning except that we don't have the loss item leading to the situation where items of different classes are separated far enough! Just like Supervised NT-Xent loss introduced in #57.

The following diagram shows a 4-way 1-shot learning example. Four different dog labels (blue, yellow, orange, and red rectangle) as support set while the input is one dog image. image Matching Nets Architecture (Vinyals et al., 2016).

Under this model architecture, Vinyals et al. evaluated it in computer vision (CV) and natural language processing (NLP) field to confirm the model can be applied to different domain problems. In the CV field, Omniglo and ImageNet are selected as a dataset for experiments.

image Performance evaluation on Omniglo (Vinyals et al., 2016).

image Performance evaluation on a subset of ImageNet (Vinyals et al., 2016).

For the NLP domain, the Penn Treebank dataset is leveraged to conduct an experiment. Giving a sentence with a missing word and set of support sentence, final model output is the best matching label of support set sentence. While the performance of Matching Nets is not promising when comparing to another model.

image Example of input data and support set in NLP (Vinyals et al., 2016).

NorbertZheng commented 1 year ago

Prototypical Networks is proposed by Snell et al. in 2017. The clustering concept is applied to predict the label of data, while

In every episode, the model will transfer both a support set and a query set to embeddings layers. A centroid of clustering (i.e., $c{1}$, $c{2}$, and $c{3}$ in the following figure) is the average value of corresponding labels. While query set (i.e., $x$ in the following figure) will be classified to the nearest cluster (i.e., $c{2}$ in the following figure. Instead of using cosine distance, Prototypical Networks

Just like Max margin contrastive loss introduced in #57.

image Few-shot prototypes (left), Zero-shot prototypes (right). (Snell et al., 2017).

Sneel et al. use exact same datasets (Omniglot and mini ImageNet) to compare the result with Matching Nets (Vinyals et al., 2016). Experiments show that Prototypical Networks (Snell et al.,2017) achieves are better performance than Matching Nets (Vinyals et al., 2016).

image Few-shot classification accuracies on Omniglot. (Snell et al., 2017).

image Few-shot classification accuracies on miniImageNet. (Snell et al., 2017).

NorbertZheng commented 1 year ago

Attentive Recurrent Comparators (ARC) is proposed by Shyam et al. (2017). It is inspired by how humans compare a set of objects. For example,

It is because our brain cannot distinguish the whole image once a time.

image Photo Hunt Game (source).

ARC uses a similar approach, which is comparing a pair of input back and forth.

0_bn3U4NRSRiIosTIl ARC looks back and forth between the two objects (source).

By leveraging the attention mechanism and long short-term memory (LSTM) architecture, ARC attended a portion of two Image A and Image B (pair of input) alternately.

A neural network converts $h_{t-1}$ (empty value in the first step) to $\Omega(t)$. The $\Omega(t)$ is a glimpse parameter which is mentioned in the paper. The simple way to extract a small portion for learning is cropping images randomly/sequentially. Instead of this simple way, Shyam et al. propose using Cauchy decay kernels because it is smoother than another way.

image Visualization of ARC comparing two objects (Shyam et al., 2017).

After a certain round of glimpse, the hidden state will be passed to a linear layer, and the output is either two images are similar or not.

image Performance evaluation on Omniglo (Shyam et al., 2017).

NorbertZheng commented 1 year ago

Optimization-Based

LSTM-based meta-learner is proposed by Ravi and Larochelle in 2016. Authors observe that gradient-based optimization fails in the face of a few labeled examples due to the limitations. A gradient-based algorithm was not designed to converge quickly, and weights needed to be initialization every time. Ravi and Larochelle

First of all, we split the dataset into two sub dataset, which is Meta-Train and Meta-Test. After that, $D{train}$ and $D{test}$ will be split as a mini-batch as following. Every record set (1 row) will feed into a model per time for model training.

image Meta-learning setup (Ravi and Larochelle, 2016).

Training a meta-learner is quite straightforward. An explanation will be provided step by step:

image LSTM-based meta-learner procedure (Ravi and Larochelle, 2016)..

image Few-shot classification accuracies on miniImageNet. (Ravi and Larochelle, 2016).

NorbertZheng commented 1 year ago

Model-Agnostic Meta-Learning (MAML) is proposed by Finn et al. in 2017. It is a model-agnostic framework. Model-agnostic means that it is not model specific. Finn et al. evaluate this framework on regression, classification, and reinforcement learning problems and result is promising.

The objective of MAML is learning a model that can make rapid progress on new tasks given that a model is pre-trained. Two gradient updates are involved in this framework.

They call it the update that involves a gradient through a gradient.

The following figure shows the concept of updating gradient by gradient. The dotted line is the gradient per task, while the solid line is the final gradient after considering all task’s gradient.

image Optimizes for a representation $\theta$ that can quickly adapt to new tasks (Finn et al., 2017).

First of all, a pre-trained MAML model needs to be trained by using the following procedure. An explanation will be provided step by step:

image Mete-learning procedure (Finn et al., 2017).

In the fine-tuning stage, it is quite similar to meta-learning phase except for no weight initialization (as we already have a pre-trained weight).

image Fine-tuning procedure (Finn et al., 2017).

NorbertZheng commented 1 year ago

Memory-Based

Memory-Augmented Neural Networks (MANN) for One-Shot Learning is proposed by Santoro et al. (2016). They apply MANN on meta-learning such that the model can access external memory (or information) to assistant model to predict the result. Before that, we will have a quick look at the MANN concept and coming back to MANN in meta-learning. It helps to address memorizing rare events.

Neural Turing Machines (NTM) is introduced by Graves et al. in 2014. A quick some summary is that

To access external memory, NTM provides two ways to address corresponding memory.

image NTM Architecture (Graves et al., 2014).

Santoro et al. follow the setup of NTM with some modifications, which are inputs and addressing methods.

image MANN Task Setup (Santoro et al., 2016).

NorbertZheng commented 1 year ago

Meta Networks (MetaNet) is introduced by Munkhdalai and Yu. MetaNet model address the lack of ability learning new task information on the fly. The proposed solution is

The task-specific embeddings are learned from meta learner, who is a task-agnostic model. After generating the embeddings, it will pass to a base-learner, which is a task-specific model to generate output.

image Meta Networks Architecture (Munkhdalai and Yu, 2017).

To train a MetaNet, there are three main procedures that are acquiring meta information, generating fast weights, and optimizing slow weights. Here is the pseudo:

image MetaNet training procedure (Munkhdalai and Yu, 2017).

NorbertZheng commented 1 year ago

Appendix

image A comparison of published meta-learning approaches (Metz et al., 2018).

NorbertZheng commented 1 year ago

Take Away

All flowers are in bloom, and hundreds of schools of thought contend. There are lots of different approaches to tackling the meta-learning problem. From my experience, meta-learning is mainly developed in computer vision (CV), while some researchers also apply it in natural language process (NLP) fields. One of the possible reasons is transfer learning (e.g., BERT, XLNet) is very successful, and a variety of vocabulary is too high to adopt meta-learning in NLP.

NorbertZheng commented 1 year ago

Extension Reading

NorbertZheng commented 1 year ago

References