Open Sepideh-Ahmadian opened 1 month ago
@Sepideh-Ahmadian why "I am doubtful how this paradigmatically related substitution can maintain producing proper sentences in every domain."? like what other domains?
Sure @hosseinfani, Consider analyzing cancer-related data in the medical domain. My concern is the following: take the sentence "The tumor is benign" and its augmented version "The tumor is harmless". The word benign has a specific meaning in this context. Although the word harmless might appear in medical descriptions (such as a CT scan report), if the model suggests harmless as the augmented version of benign, it could confuse the model during a classification task, as the model may fail to understand the specific terminology of the domain.
Paper Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations
Introduction This article is part of a series of efforts that have used language models for data augmentation. They assume there is an invariance (i.e., the change would be natural) when a word in a sentence is substituted with other words that are paradigmatically related. The new words are suggested by a bi-directional language model at the word positions for each word in a sentence.
Main problem It is difficult to establish a universal rule for transforming language in a way that preserves class labels while being applicable across various domains (generalization). This work suggests word substitutions based on paradigmatic relations. Previous efforts that only considered synonym substitution using WordNet were limited, as the number of synonyms for each word is restricted. This research also considers the label when determining contextual substitutions. For instance, in the sentence "the actors are fantastic", the synonyms for fantastic in a positive label might be funny, while in a negative label, it could be dull. This method has been tested to ensure label validity after substitution.
Illustrative Example In this example only substitution of word actor is considered: Original review: The actors are fantastic Augmented sentences: The performances (films, movies, stories) are fantastic.
Input A Sentence (Sequence of words. (e.g. The actors are fantastic))
Output K sentences that are comprised of high probability outcome of the model (e.g. The characters are funny (positive label), The characters are tired (negative label))
Motivation In previous works, the word ‘actor’ in ‘The actors are fantastic’, using synset of a word in WordNet, can be replaced by, players, historian considering the average similarity of the words. However, the word actor can be replaced with non-synonyms words such as characters, movies, stories in a way that keep the sentiment, naturalness, and context.
Related works and their gaps They have used following methods for data augmentation:
Contribution of this paper They proposed a way to use context-based synonyms generation to overcome the drawback of using synonyms replacement. They added a label parameter to conditional probability to avoid label flipping.
Proposed Method They proposed a LM to calculate the probability of in a word in specific position based on the context. The context is also a sequence of words surrounding that word. They have used bidirectional LSTM-RNN to learn about the context and choose the relevant words from dictionary. There is also a risk of class-label flipping. Since substitutions are suggested for all words in a sentence, the class label may change. For instance, the sentence "all actors are fantastic" could be altered to "no actors are fantastic", which would change the meaning and, consequently, the class label. To prevent that, they have used a label conditioning technique.
Experiments Datasets:
Models :
Implementation https://github.com/pfnet-research/
Gaps of the work: This work may face some limitation regarding low-resource languages, biases from pretrained models. Additionally, it may have hard time dealing with complex sentences, since substituted words may change the structure dramatically. I am doubtful how this paradigmatically related substitution can maintain producing proper sentences in every domain.