Closed NorbertZheng closed 1 year ago
Human begins can learn new things in a few examples, while
deep learning thus far is data-hungry, e.g. sample efficiency.
To have a good performance model, millions, or even billions of training examples definitely is a classic way to achieve it. Data augmentation is one of the methods to generate synthetic samples. Moreover,
Can we build a model that consumes a few examples to pick up a new skill just like a human being? Meta-Learning (as know as learning to learn) was born to tackle this problem. Giving a few examples, the meta-learning model can learn and adapt to a new domain quickly.
In this series of meta-learning stories, we will go through the concept of meta-learning, several meta-learning approaches, and examples. You may visit the following to get familiar with meta-learning
In this story, we will cover terminology, idea of meta-learning, and introducing different approaches:
In general machine learning terminology, we only have a train set, testing set, and validation set for training, testing, and validation. In meta-learning, those names are renamed as meta-training set
, meta-testing set
and meta-validation set
.
meta-training set
, we have a number of training set and testing set to form a task.The support set is a set of records (input and label), while labels are distinct and different in every task. Query set is another set of records with a label for matching input to select a label.
N-way K-shot refers to
We have One-Shot Learning and Few-Shot Learning when there is just one label or few labels (less than a total number of labels). The key ideas are data transformation and knowledge sharing.
For example, we have 100 records (input and label), while there are ten distinct labels. In every batch, only K number of records (contains N labels) will be feed into the model. N does not need a match with a total number of distinct labels (i.e., 10 in this case). The labels are consistent for both the support set and a query set of the same task but vary across tasks.
Different from general model training, we only pass a subset of the label for model training. The model has the capability to predict unseen label during prediction time.
Illustration of meta-learning terminology.
Matching Networks (Matching Nets) is proposed by Vinyals et al. (2016). The idea is simple, but the result is promising. After transforming the support set and query set to embeddings vector, authors use cosine distance (i.e. normalized similarity) to find the most similar data. Matching Nets Framework repeats the following steps in every task.
The above training process is pretty much like that of contrastive learning except that we don't have the loss item leading to the situation where items of different classes are separated far enough! Just like Supervised NT-Xent loss introduced in #57.
The following diagram shows a 4-way 1-shot learning example. Four different dog labels (blue, yellow, orange, and red rectangle) as support set while the input is one dog image. Matching Nets Architecture (Vinyals et al., 2016).
Under this model architecture, Vinyals et al. evaluated it in computer vision (CV) and natural language processing (NLP) field to confirm the model can be applied to different domain problems. In the CV field, Omniglo and ImageNet are selected as a dataset for experiments.
Performance evaluation on Omniglo (Vinyals et al., 2016).
Performance evaluation on a subset of ImageNet (Vinyals et al., 2016).
For the NLP domain, the Penn Treebank dataset is leveraged to conduct an experiment. Giving a sentence with a missing word and set of support sentence, final model output is the best matching label of support set sentence. While the performance of Matching Nets is not promising when comparing to another model.
Example of input data and support set in NLP (Vinyals et al., 2016).
Prototypical Networks is proposed by Snell et al. in 2017. The clustering concept is applied to predict the label of data, while
In every episode, the model will transfer both a support set and a query set to embeddings layers. A centroid of clustering (i.e., $c{1}$, $c{2}$, and $c{3}$ in the following figure) is the average value of corresponding labels. While query set (i.e., $x$ in the following figure) will be classified to the nearest cluster (i.e., $c{2}$ in the following figure. Instead of using cosine distance, Prototypical Networks
Just like Max margin contrastive loss introduced in #57.
Few-shot prototypes (left), Zero-shot prototypes (right). (Snell et al., 2017).
Sneel et al. use exact same datasets (Omniglot and mini ImageNet) to compare the result with Matching Nets (Vinyals et al., 2016). Experiments show that Prototypical Networks (Snell et al.,2017) achieves are better performance than Matching Nets (Vinyals et al., 2016).
Few-shot classification accuracies on Omniglot. (Snell et al., 2017).
Few-shot classification accuracies on miniImageNet. (Snell et al., 2017).
Attentive Recurrent Comparators (ARC) is proposed by Shyam et al. (2017). It is inspired by how humans compare a set of objects. For example,
It is because our brain cannot distinguish the whole image once a time.
Photo Hunt Game (source).
ARC uses a similar approach, which is comparing a pair of input back and forth.
ARC looks back and forth between the two objects (source).
By leveraging the attention mechanism and long short-term memory (LSTM) architecture, ARC attended a portion of two Image A and Image B (pair of input) alternately.
A neural network converts $h_{t-1}$ (empty value in the first step) to $\Omega(t)$. The $\Omega(t)$ is a glimpse parameter which is mentioned in the paper. The simple way to extract a small portion for learning is cropping images randomly/sequentially. Instead of this simple way, Shyam et al. propose using Cauchy decay kernels because it is smoother than another way.
Visualization of ARC comparing two objects (Shyam et al., 2017).
After a certain round of glimpse, the hidden state will be passed to a linear layer, and the output is either two images are similar or not.
Performance evaluation on Omniglo (Shyam et al., 2017).
LSTM-based meta-learner is proposed by Ravi and Larochelle in 2016. Authors observe that gradient-based optimization fails in the face of a few labeled examples due to the limitations. A gradient-based algorithm was not designed to converge quickly, and weights needed to be initialization every time. Ravi and Larochelle
First of all, we split the dataset into two sub dataset, which is Meta-Train and Meta-Test. After that, $D{train}$ and $D{test}$ will be split as a mini-batch as following. Every record set (1 row) will feed into a model per time for model training.
Meta-learning setup (Ravi and Larochelle, 2016).
Training a meta-learner is quite straightforward. An explanation will be provided step by step:
LSTM-based meta-learner procedure (Ravi and Larochelle, 2016)..
Few-shot classification accuracies on miniImageNet. (Ravi and Larochelle, 2016).
Model-Agnostic Meta-Learning (MAML) is proposed by Finn et al. in 2017. It is a model-agnostic framework. Model-agnostic means that it is not model specific. Finn et al. evaluate this framework on regression, classification, and reinforcement learning problems and result is promising.
The objective of MAML is learning a model that can make rapid progress on new tasks given that a model is pre-trained. Two gradient updates are involved in this framework.
They call it the update that involves a gradient through a gradient.
The following figure shows the concept of updating gradient by gradient. The dotted line is the gradient per task, while the solid line is the final gradient after considering all task’s gradient.
Optimizes for a representation $\theta$ that can quickly adapt to new tasks (Finn et al., 2017).
First of all, a pre-trained MAML model needs to be trained by using the following procedure. An explanation will be provided step by step:
Mete-learning procedure (Finn et al., 2017).
In the fine-tuning stage, it is quite similar to meta-learning phase except for no weight initialization (as we already have a pre-trained weight).
Fine-tuning procedure (Finn et al., 2017).
Memory-Augmented Neural Networks (MANN) for One-Shot Learning is proposed by Santoro et al. (2016). They apply MANN on meta-learning such that the model can access external memory (or information) to assistant model to predict the result. Before that, we will have a quick look at the MANN concept and coming back to MANN in meta-learning. It helps to address memorizing rare events.
Neural Turing Machines (NTM) is introduced by Graves et al. in 2014. A quick some summary is that
To access external memory, NTM provides two ways to address corresponding memory.
NTM Architecture (Graves et al., 2014).
Santoro et al. follow the setup of NTM with some modifications, which are inputs and addressing methods.
MANN Task Setup (Santoro et al., 2016).
Meta Networks (MetaNet) is introduced by Munkhdalai and Yu. MetaNet model address the lack of ability learning new task information on the fly. The proposed solution is
The task-specific embeddings are learned from meta learner, who is a task-agnostic model. After generating the embeddings, it will pass to a base-learner, which is a task-specific model to generate output.
Meta Networks Architecture (Munkhdalai and Yu, 2017).
To train a MetaNet, there are three main procedures that are acquiring meta information, generating fast weights, and optimizing slow weights. Here is the pseudo:
MetaNet training procedure (Munkhdalai and Yu, 2017).
A comparison of published meta-learning approaches (Metz et al., 2018).
All flowers are in bloom, and hundreds of schools of thought contend. There are lots of different approaches to tackling the meta-learning problem. From my experience, meta-learning is mainly developed in computer vision (CV), while some researchers also apply it in natural language process (NLP) fields. One of the possible reasons is transfer learning (e.g., BERT, XLNet) is very successful, and a variety of vocabulary is too high to adopt meta-learning in NLP.
Edward Ma. A Gentle Introduction to Meta-Learning.