Edward Ma | A Gentle Introduction to Meta-Learning.

NorbertZheng / read-papers

My paper reading notes.

MIT License

7 stars 0 forks source link

Edward Ma | A Gentle Introduction to Meta-Learning. #67

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Edward Ma. A Gentle Introduction to Meta-Learning.

NorbertZheng commented 1 year ago

Overview

Human begins can learn new things in a few examples, while

deep learning thus far is data-hungry, e.g. sample efficiency.

To have a good performance model, millions, or even billions of training examples definitely is a classic way to achieve it. Data augmentation is one of the methods to generate synthetic samples. Moreover,
the standard neural networks cannot learn new knowledge on the fly, e.g. continuous learning.

Can we build a model that consumes a few examples to pick up a new skill just like a human being? Meta-Learning (as know as learning to learn) was born to tackle this problem. Giving a few examples, the meta-learning model can learn and adapt to a new domain quickly.

In this series of meta-learning stories, we will go through the concept of meta-learning, several meta-learning approaches, and examples. You may visit the following to get familiar with meta-learning

Introduction to Meta-Learning
Enhancement of MAML
Meta-Learning in NLP Classification
Meta-Learning in Dialog Generation
Unsupervised Learning in Meta-Learning

In this story, we will cover terminology, idea of meta-learning, and introducing different approaches:

Data Set
Support Set and Query Set
N-way K-shot
Approaches

NorbertZheng commented 1 year ago

Terminology

Dataset

In general machine learning terminology, we only have a ~~train set~~, ~~testing set~~, and ~~validation set~~ for training, testing, and validation. In meta-learning, those names are renamed as meta-training set, meta-testing set and meta-validation set.

Within meta-training set, we have a number of training set and testing set to form a task.

Support Set and Query Set

The support set is a set of records (input and label), while labels are distinct and different in every task. Query set is another set of records with a label for matching input to select a label.

N-way K-shot

N-way K-shot refers to

a number of labels and k training data per label. N stands for a number of the label, while K is the number of training data.

We have One-Shot Learning and Few-Shot Learning when there is just one label or few labels (less than a total number of labels). The key ideas are data transformation and knowledge sharing.

For example, we have 100 records (input and label), while there are ten distinct labels. In every batch, only K number of records (contains N labels) will be feed into the model. N does not need a match with a total number of distinct labels (i.e., 10 in this case). The labels are consistent for both the support set and a query set of the same task but vary across tasks.

Different from general model training, we only pass a subset of the label for model training. The model has the capability to predict unseen label during prediction time.

Illustration of meta-learning terminology.

NorbertZheng commented 1 year ago

Approaches

Metric-Based

Matching Networks (Matching Nets) is proposed by Vinyals et al. (2016). The idea is simple, but the result is promising. After transforming the support set and query set to embeddings vector, authors use cosine distance (i.e. normalized similarity) to find the most similar data. Matching Nets Framework repeats the following steps in every task.

Pick 4 (i.e., N) distinct label pair (input and label) data as a support set.
Pick 1 (i.e., K) number of pair data while the label should be one of the step 1 results.
Calculate the difference (e.g., cosine distance) between the predicted results from step 2 with a support set.

The above training process is pretty much like that of contrastive learning except that we don't have the loss item leading to the situation where items of different classes are separated far enough! Just like Supervised NT-Xent loss introduced in #57.

The following diagram shows a 4-way 1-shot learning example. Four different dog labels (blue, yellow, orange, and red rectangle) as support set while the input is one dog image. Matching Nets Architecture (Vinyals et al., 2016).

Under this model architecture, Vinyals et al. evaluated it in computer vision (CV) and natural language processing (NLP) field to confirm the model can be applied to different domain problems. In the CV field, Omniglo and ImageNet are selected as a dataset for experiments.

Performance evaluation on Omniglo (Vinyals et al., 2016).

Performance evaluation on a subset of ImageNet (Vinyals et al., 2016).

For the NLP domain, the Penn Treebank dataset is leveraged to conduct an experiment. Giving a sentence with a missing word and set of support sentence, final model output is the best matching label of support set sentence. ~~While the performance of Matching Nets is not promising when comparing to another model.~~

Example of input data and support set in NLP (Vinyals et al., 2016).

NorbertZheng commented 1 year ago

Prototypical Networks is proposed by Snell et al. in 2017. The clustering concept is applied to predict the label of data, while

this clustering model will be trained for each episode and computing the loss according to this model.

In every episode, the model will transfer both a support set and a query set to embeddings layers. A centroid of clustering (i.e., $c{1}$, $c{2}$, and $c{3}$ in the following figure) is the average value of corresponding labels. While query set (i.e., $x$ in the following figure) will be classified to the nearest cluster (i.e., $c{2}$ in the following figure. Instead of using cosine distance, Prototypical Networks

uses squared of euclidean distance to compute the difference among query sets and the centroid of support set.

Just like Max margin contrastive loss introduced in #57.

Few-shot prototypes (left), Zero-shot prototypes (right). (Snell et al., 2017).

Sneel et al. use exact same datasets (Omniglot and mini ImageNet) to compare the result with Matching Nets (Vinyals et al., 2016). Experiments show that Prototypical Networks (Snell et al.,2017) achieves are better performance than Matching Nets (Vinyals et al., 2016).

Few-shot classification accuracies on Omniglot. (Snell et al., 2017).

Few-shot classification accuracies on miniImageNet. (Snell et al., 2017).

NorbertZheng commented 1 year ago

Attentive Recurrent Comparators (ARC) is proposed by Shyam et al. (2017). It is inspired by how humans compare a set of objects. For example,

humans always look back and forth between two objects when playing a photo hunt game.

It is because our brain cannot distinguish the whole image once a time.

Photo Hunt Game (source).

ARC uses a similar approach, which is comparing a pair of input back and forth.

Instead of the whole pass image, part of two images will be passed to model in every glimpse alternatively.

0_bn3U4NRSRiIosTIl ARC looks back and forth between the two objects (source).

By leveraging the attention mechanism and long short-term memory (LSTM) architecture, ARC attended a portion of two Image A and Image B (pair of input) alternately.

A neural network converts $h_{t-1}$ (empty value in the first step) to $\Omega(t)$. The $\Omega(t)$ is a glimpse parameter which is mentioned in the paper. The simple way to extract a small portion for learning is cropping images randomly/sequentially. Instead of this simple way, Shyam et al. propose using Cauchy decay kernels because it is smoother than another way.

Visualization of ARC comparing two objects (Shyam et al., 2017).

After a certain round of glimpse, the hidden state will be passed to a linear layer, and the output is either two images are similar or not.

Performance evaluation on Omniglo (Shyam et al., 2017).

NorbertZheng commented 1 year ago

Optimization-Based

LSTM-based meta-learner is proposed by Ravi and Larochelle in 2016. Authors observe that gradient-based optimization fails in the face of a few labeled examples due to the limitations. ~~A gradient-based algorithm was not designed to converge quickly, and weights needed to be initialization every time.~~ Ravi and Larochelle

train an LSTM-based meta-learner optimizer to optimize a final neural network model.

First of all, we split the dataset into two sub dataset, which is Meta-Train and Meta-Test. After that, $D{train}$ and $D{test}$ will be split as a mini-batch as following. Every record set (1 row) will feed into a model per time for model training.

Meta-learning setup (Ravi and Larochelle, 2016).

Training a meta-learner is quite straightforward. An explanation will be provided step by step:

1: Model parameters are random initialization.
4: Split meta-train dataset to $D{train}$ and $D{test}$.
5: Initialize meta-learner parameters.
8: Split $D_{train}$ to batch.
9: Train the learner and get loss.
10: Feed loss and output to meta-learner.
11: Update learner parameters according to the output of step 10.
15: Get loss from learner for the test batch.
16: Update meta-learner parameters.

LSTM-based meta-learner procedure (Ravi and Larochelle, 2016)..

Few-shot classification accuracies on miniImageNet. (Ravi and Larochelle, 2016).

NorbertZheng commented 1 year ago

Model-Agnostic Meta-Learning (MAML) is proposed by Finn et al. in 2017. It is a model-agnostic framework. Model-agnostic means that it is not model specific. Finn et al. evaluate this framework on regression, classification, and reinforcement learning problems and result is promising.

The objective of MAML is learning a model that can make rapid progress on new tasks given that a model is pre-trained. Two gradient updates are involved in this framework.

After a batch gradients are computed (not update to model yet, which is different from the LSTM-based meta-learner), the second gradient will be computed based on the aforementioned batch of gradients.

They call it the update that involves a gradient through a gradient.

The following figure shows the concept of updating gradient by gradient. The dotted line is the gradient per task, while the solid line is the final gradient after considering all task’s gradient.

Optimizes for a representation $\theta$ that can quickly adapt to new tasks (Finn et al., 2017).

First of all, a pre-trained MAML model needs to be trained by using the following procedure. An explanation will be provided step by step:

Require a distribution over tasks: Generate batches of tasks from a task pool. The same record can exist in the different tasks, and it can be support set, the query set in Task 1, Task 2, respectively.
Require step size hyperparameters: Initialize a learning rate parameters.
1: Weight initialization.
2: A for loop that keeps training until complete.
3: Get a batch of tasks from task pool.
4~7: A for loop to go through task one by one. Evaluate tasks and compute the gradient descent per task. In actual coding, there is inner for loop to evaluate task multiple times.
8: Update the gradient descent per batch of tasks according to step 6.

Mete-learning procedure (Finn et al., 2017).

In the fine-tuning stage, it is quite similar to meta-learning phase except for no weight initialization (as we already have a pre-trained weight).

Fine-tuning procedure (Finn et al., 2017).

NorbertZheng commented 1 year ago

Memory-Based

Memory-Augmented Neural Networks (MANN) for One-Shot Learning is proposed by Santoro et al. (2016). They apply MANN on meta-learning such that the model can access external memory (or information) to assistant model to predict the result. Before that, we will have a quick look at the MANN concept and coming back to MANN in meta-learning. It helps to address memorizing rare events.

Neural Turing Machines (NTM) is introduced by Graves et al. in 2014. A quick some summary is that

the model reply on both internal memory (i.e., RNN hidden states) and external memory (i.e., memory bank out of neural network) to decide the output.

To access external memory, NTM provides two ways to address corresponding memory.

The first one is content-addressing, the model will attend to similar memory while the similarity measure is cosine similarity.
Another one is location-addressing, the model will attend to a particular location by rotation.

NTM Architecture (Graves et al., 2014).

Santoro et al. follow the setup of NTM with some modifications, which are inputs and addressing methods.

Inputs include both feature $x{t}$ and time offset of $y{t-1}$. The time offset of $y$ means that the previous’ record $y$ label.
Least Recently Used Access (LRUA) is introduced for attending external memory. Instead of using ~~content-addressing~~ or ~~location-addressing~~, Graves et al. mentioned LRUA fits sequence-based prediction tasks but not for a conjunctive coding of information independent of sequence task. LRUA writes the memory to the least used memory location or the most recently used memory location.

MANN Task Setup (Santoro et al., 2016).

NorbertZheng commented 1 year ago

Meta Networks (MetaNet) is introduced by Munkhdalai and Yu. MetaNet model address ~~the lack of ability learning new task information on the fly~~. The proposed solution is

leveraging both task-specific embeddings (fast weight) and learned embeddings (slow weights).

The task-specific embeddings are learned from meta learner, who is a task-agnostic model. After generating the embeddings, it will pass to a base-learner, which is a task-specific model to generate output.

Meta Networks Architecture (Munkhdalai and Yu, 2017).

To train a MetaNet, there are three main procedures that are acquiring meta information, generating fast weights, and optimizing slow weights. Here is the pseudo:

Step 2~4: Train the model to acquire embeddings while the $loss_{emb}$ is designed for capturing representation objective.
Step 5: Generate embeddings.
Step 7~14: Train the model for the target task.
Step 16~20: Generating the output.

MetaNet training procedure (Munkhdalai and Yu, 2017).

NorbertZheng commented 1 year ago

Appendix

A comparison of published meta-learning approaches (Metz et al., 2018).

NorbertZheng commented 1 year ago

Take Away

All flowers are in bloom, and hundreds of schools of thought contend. There are lots of different approaches to tackling the meta-learning problem. From my experience, meta-learning is mainly developed in computer vision (CV), while some researchers also apply it in natural language process (NLP) fields. One of the possible reasons is transfer learning (e.g., BERT, XLNet) is very successful, and a variety of vocabulary is too high to adopt meta-learning in NLP.

NorbertZheng commented 1 year ago

Extension Reading

Model vs. Optimization Meta-Learning by Vinyals (NIPS 2017)
Data Augmentation for Text
Prototypical Networks Implementation (PyTorch)
Matching Nets Implementation (TensorFlow, PyTorch)
ARC Implementation (Theano, PyTorch)
LSTM-based meta-learning (Torch, PyTorch)
MAML Implementation (TensorFlow, PyTorch)
One-Shot Learning in MANN Implementation (Tensorflow, Keras)
Meta Networks Implementation (Chainer)
Omniglot Dataset
ImageNet Dataset

NorbertZheng commented 1 year ago

References

A. Graves, G. Wayne, and I. Danihelka. Neural Turing Machines. 2014
A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. One-shot Learning with Memory-Augmented Neural Networks. 2016
O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching Networks for One Shot Learning. 2016
S. Ravi and H. Larochelle. Optimization as a Model for Few-Shot Learning. 2016
J. Snell, K. Swersky, and R. S. Zemel. Prototypical Networks for Few-shot Learning. 2017
P. Shyam, S. Gupta, and A. Dukkipati. Attentive Recurrent Comparators. 2017
C. Finn, P. Abbeel, and S. Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. 2017
T. Munkhdalai and H. Yu. Meta Networks. 2017
L. Metz, N. Maheswaranathan, B. Cheung, and J. Sohl-Dickstein. Meta-Learning Update Rules for Unsupervised Representation Learning. 2018