2020, ACL, Data Augmentation Using Pre-Trained Transformer Models

Introduction This paper focuses on using transformer-based models (GPT-2, BERT, BART) for conditional data augmentation. By conditional, it means model G receives class information during fine-tuning by adding class labels to the text sequence in the two ways that are explained later.

Main Problem The previously used methods have some problems regarding maintaining the class label and generalization. In addition, the paper aims to provide a unified comparison of pre-trained transformer-based models in data augmentation that have been separately examined in previous studies providing a unified comparison.

Illustrative Example There is no example in this paper or GitHub. But this is a bad example that is provided for class flipping during data augmentation that should be avoided. “a small impact with a big movie” leads to “a small movie with a big impact”.

Input xi, yi in dataset train (xi is a sequence of m words, yi associated label)

Output s examples of xi, yi using tuned model (s=1 D-train is equal to D-augmented)

Motivation The use of pre-trained models was very promising in the work of Anaby-Tavor et al. (2019) however, there were no similar works to provide a comprehensive comparison of pre-trained Transformers.

Related works + Their gaps

Word replacements by (EDA by Wei and Zou (2019))
Language models (Kobayashi (2018)) => (these methods struggle to preserve label and leads to confuse the model)
Conditional BERT by, Wu et al. (2019) (Mask language Modeling task) the weak point it can not be generalized to other pretrained models
GPT-2 by Anaby-Tavor et al. (2019) : Data is generated by providing a class label to a fine-tuned model. The result was promising but there are no other methods in this category to compare the results.

Contribution of this paper

Comparison of three conditional pretrained-transformers models for DA
A guideline for using each of them in different DA scenarios

Proposed Method: Conditional DA in more detail: Conditional means during fine-tuning phase, model G receives the labels of data. Wu et al. (2019) proposed CBERT, they use BERT segments embeddings to condition models. However, it is restricted to this specific architecture. Therefore, authors proposed generic methods to do that: 1) Prepend: prepending label yi to each sequence xi (not adding yi to the vocabulary of the model => model treat the label as a token) 2) Expand prepending label yi to each sequence xi (with adding yi to the vocabulary of the model)

Models:

An autoencoder language model: BERT (used for fine-tuning with the MLM objective).
An auto-regressive (AR) language model: GPT-2 (where the model predicts the next word given the context). The fine-tuning method by Anaby-Tavor et al. (2019) has been used. The structure of each sentence in the training dataset is as follows: yi [SEP] xi [EOS]. For data generation, they provide prompts like yi [SEP] for generating sentences. The drawback of this data generation method is that it does not preserve label information. A simple way to improve this is by providing more context, such as yi [SEP] w1…wk, where k = 3.
A pre-trained seq2seq model: BART (Lewis et al., 2019), which performs better with word-level masking.

Experiments Models:

An autoencoder language model: BERT
An auto-regressive (AR) language model: GPT-2
A pre-trained seq2seq model: BART

NLP tasks:

Sentiment classification
Intent classification
Question classification

Datasets:

SST-2 (Socher et al., 2013) sentiment classification on movie reviews.
SNIPS (Coucke et al., 2018) contains 7 class of intents from the Snips personal voice assistance.
TREC (Li and Roth, 2002) question classification datasets.

Base lines:

EDA
Backtranslation
CBERT

Implementation: varunkumar-dev/TransformersDataAugmentation: Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper (github.com)

Gaps this work They only used the conditional method. What about other methods? They did not show any improvements on resource-rich datasets, only on small datasets.

In comparison to backtranslation:

Although backtranslation ensures naturalness, the results of this method heavily depend on the pre-trained dataset and may contain model biases.
It is not applicable to low-resource languages.
There is no time allocated for training.

fani-lab / LADy

2020, ACL, Data Augmentation Using Pre-Trained Transformer Models #79