fani-lab / LADy

LADy 💃: A Benchmark Toolkit for Latent Aspect Detection Enriched with Backtranslation Augmentation
Other
5 stars 6 forks source link

2020, ACL, Data Augmentation Using Pre-Trained Transformer Models #79

Open Sepideh-Ahmadian opened 2 months ago

Sepideh-Ahmadian commented 2 months ago

Introduction This paper focuses on using transformer-based models (GPT-2, BERT, BART) for conditional data augmentation. By conditional, it means model G receives class information during fine-tuning by adding class labels to the text sequence in the two ways that are explained later.

Main Problem The previously used methods have some problems regarding maintaining the class label and generalization. In addition, the paper aims to provide a unified comparison of pre-trained transformer-based models in data augmentation that have been separately examined in previous studies providing a unified comparison.

Illustrative Example There is no example in this paper or GitHub. But this is a bad example that is provided for class flipping during data augmentation that should be avoided. “a small impact with a big movie” leads to “a small movie with a big impact”.

Input xi, yi in dataset train (xi is a sequence of m words, yi associated label)

Output s examples of xi, yi using tuned model (s=1 D-train is equal to D-augmented)

Motivation The use of pre-trained models was very promising in the work of Anaby-Tavor et al. (2019) however, there were no similar works to provide a comprehensive comparison of pre-trained Transformers.

Related works + Their gaps

  1. Word replacements by (EDA by Wei and Zou (2019))
  2. Language models (Kobayashi (2018)) => (these methods struggle to preserve label and leads to confuse the model)
  3. Conditional BERT by, Wu et al. (2019) (Mask language Modeling task) the weak point it can not be generalized to other pretrained models
  4. GPT-2 by Anaby-Tavor et al. (2019) : Data is generated by providing a class label to a fine-tuned model. The result was promising but there are no other methods in this category to compare the results.

Contribution of this paper

  1. Comparison of three conditional pretrained-transformers models for DA
  2. A guideline for using each of them in different DA scenarios

Proposed Method: Conditional DA in more detail: Conditional means during fine-tuning phase, model G receives the labels of data. Wu et al. (2019) proposed CBERT, they use BERT segments embeddings to condition models. However, it is restricted to this specific architecture. Therefore, authors proposed generic methods to do that: 1) Prepend: prepending label yi to each sequence xi (not adding yi to the vocabulary of the model => model treat the label as a token) 2) Expand prepending label yi to each sequence xi (with adding yi to the vocabulary of the model)

Models:

  1. An autoencoder language model: BERT (used for fine-tuning with the MLM objective).
  2. An auto-regressive (AR) language model: GPT-2 (where the model predicts the next word given the context). The fine-tuning method by Anaby-Tavor et al. (2019) has been used. The structure of each sentence in the training dataset is as follows: yi [SEP] xi [EOS]. For data generation, they provide prompts like yi [SEP] for generating sentences. The drawback of this data generation method is that it does not preserve label information. A simple way to improve this is by providing more context, such as yi [SEP] w1…wk, where k = 3.
  3. A pre-trained seq2seq model: BART (Lewis et al., 2019), which performs better with word-level masking.

Experiments Models:

  1. An autoencoder language model: BERT
  2. An auto-regressive (AR) language model: GPT-2
  3. A pre-trained seq2seq model: BART

NLP tasks:

  1. Sentiment classification
  2. Intent classification
  3. Question classification

Datasets:

  1. SST-2 (Socher et al., 2013) sentiment classification on movie reviews.
  2. SNIPS (Coucke et al., 2018) contains 7 class of intents from the Snips personal voice assistance.
  3. TREC (Li and Roth, 2002) question classification datasets.

Base lines:

  1. EDA
  2. Backtranslation
  3. CBERT

Implementation: varunkumar-dev/TransformersDataAugmentation: Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper (github.com)

Gaps this work They only used the conditional method. What about other methods? They did not show any improvements on resource-rich datasets, only on small datasets.

In comparison to backtranslation:

  1. Although backtranslation ensures naturalness, the results of this method heavily depend on the pre-trained dataset and may contain model biases.
  2. It is not applicable to low-resource languages.
  3. There is no time allocated for training.