2018, ACL, Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

Sepideh-Ahmadian commented 1 month ago

Paper Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

Introduction This article is part of a series of efforts that have used language models for data augmentation. They assume there is an invariance (i.e., the change would be natural) when a word in a sentence is substituted with other words that are paradigmatically related. The new words are suggested by a bi-directional language model at the word positions for each word in a sentence.

Main problem It is difficult to establish a universal rule for transforming language in a way that preserves class labels while being applicable across various domains (generalization). This work suggests word substitutions based on paradigmatic relations. Previous efforts that only considered synonym substitution using WordNet were limited, as the number of synonyms for each word is restricted. This research also considers the label when determining contextual substitutions. For instance, in the sentence "the actors are fantastic", the synonyms for fantastic in a positive label might be funny, while in a negative label, it could be dull. This method has been tested to ensure label validity after substitution.

Illustrative Example In this example only substitution of word actor is considered: Original review: The actors are fantastic Augmented sentences: The performances (films, movies, stories) are fantastic.

Input A Sentence (Sequence of words. (e.g. The actors are fantastic))

Output K sentences that are comprised of high probability outcome of the model (e.g. The characters are funny (positive label), The characters are tired (negative label))

Motivation In previous works, the word ‘actor’ in ‘The actors are fantastic’, using synset of a word in WordNet, can be replaced by, players, historian considering the average similarity of the words. However, the word actor can be replaced with non-synonyms words such as characters, movies, stories in a way that keep the sentiment, naturalness, and context.

Related works and their gaps They have used following methods for data augmentation:

using synonym lists (Zhang et al., 2015; Wang and Yang, 2015), In previous works, researchers used synonyms that are selected from Wordnet (Miller, 1995; Zhang et al., 2015) and similarity calculations (Wang and Yang, 2015). => Since the synonyms that have the same or near meaning is few, and it is applicable to a small set of words in a sentence can not generate numerous patterns from the original sentence.
grammar induction (Jia and Liang, 2016),
task-specific heuristic rules (Furstenau and Lapata, 2009; Kafle et al., 2017; Silfverberg et al., 2017),
Neural decoders of autoencoders (Bergmanis et al., 2017; Xu et al., 2017; Hu et al., 2017)
Encoder-decoder models (Kim and Rush, 2016; Sennrich et al., 2016; Xia et al., 2017).
The most similar research to this work is by Kolomiyets et al. (2011) and Fadaee et al. (2017). Fadaee used this method to solve rare word problem in machine translation.

Contribution of this paper They proposed a way to use context-based synonyms generation to overcome the drawback of using synonyms replacement. They added a label parameter to conditional probability to avoid label flipping.

Proposed Method They proposed a LM to calculate the probability of in a word in specific position based on the context. The context is also a sequence of words surrounding that word. They have used bidirectional LSTM-RNN to learn about the context and choose the relevant words from dictionary. There is also a risk of class-label flipping. Since substitutions are suggested for all words in a sentence, the class label may change. For instance, the sentence "all actors are fantastic" could be altered to "no actors are fantastic", which would change the meaning and, consequently, the class label. To prevent that, they have used a label conditioning technique.

Experiments Datasets:

SST2,5(Sentiment classification on movie review)
SUBJ (Subjectivity dataset) annotated in a way that whether a sentence was subjective or objective
MPQA opinion polarity detection of short phrases
RT movie review sentiment analysis
TREC classification of question types

Models :

LSTM-RNN
CNN

Implementation https://github.com/pfnet-research/

Gaps of the work: This work may face some limitation regarding low-resource languages, biases from pretrained models. Additionally, it may have hard time dealing with complex sentences, since substituted words may change the structure dramatically. I am doubtful how this paradigmatically related substitution can maintain producing proper sentences in every domain.

hosseinfani commented 1 month ago

@Sepideh-Ahmadian why "I am doubtful how this paradigmatically related substitution can maintain producing proper sentences in every domain."? like what other domains?

Sepideh-Ahmadian commented 1 month ago

Sure @hosseinfani, Consider analyzing cancer-related data in the medical domain. My concern is the following: take the sentence "The tumor is benign" and its augmented version "The tumor is harmless". The word benign has a specific meaning in this context. Although the word harmless might appear in medical descriptions (such as a CT scan report), if the model suggests harmless as the augmented version of benign, it could confuse the model during a classification task, as the model may fail to understand the specific terminology of the domain.

fani-lab / LADy

2018, ACL, Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations #80