utterances-bot commented 4 years ago

A Visual Survey of Data Augmentation in NLP

An extensive overview of text data augmentation techniques for Natural Language Processing

https://amitness.com/2020/05/data-augmentation-for-nlp/

kurianbenoy commented 4 years ago

Thanks for this amazing article bro! I was just thinking, how can I do augmentation with text

KrithikaJayaraman commented 4 years ago

Excellent article. Keep going!

amitaug1984 commented 4 years ago

Awesome work, summarized it.

1.Lexical Substitution:

Thesaurus based substitution : words replaced by synonyms
Word Embeddings Substitution : replace with neighbour word in embedding space
Masked Language Model : Model to predict masked word
TFIDF based word replacement : word with low TD-IDF scores can be replaced without affecting ground truth

2.Back Translation : English to other language - back to english

3.Text Surface Transformation : transforming through contraction and expansion

4.Random Noise Injection :

Spelling error injection
Qwerty keyboard error injection
Unigram Noising : replace words based on unigram frequency distribution
Blank Noising : replace random word with placeholder
Sentence Shuffling : Shuffling of sentences

5.Instance Cross Augmentation:tweets with same polarity have their halves swapped

6.Syntax-tree Manipulation:active voice to passive voice

7.MixUp for Text :

wordMixup : word embeddings combined and passed through classifier
sentMixup : word embeddings passed through encoder,then combined and classification performed

8.Generative Methods : Generates additiona training data

Conditional Pre-trained Language Models : Fine tuning of pre-trained language model

sids07 commented 4 years ago

Wonderful article pretty informative

NLP-cr commented 4 years ago

Thanks for the wonderful review. Please note that the Generative Methods technic you presented (8) was first proposed by the paper: Not Enough Data? Deep Learning to the Rescue! (https://arxiv.org/abs/1911.03118) I think I saw it in the AAAI20 conference.

amitness commented 4 years ago

@NLP-cr Thank you for pointing that out. I've reviewed the link you shared and have corrected the relevant section.

puzzler10 commented 4 years ago

Nice list! One more to add. I've seen text adversarial examples being used as data augmentation with some success (e.g. https://www.aclweb.org/anthology/N18-1089/), although this works best for small datasets, and may reduce accuracy for larger ones (https://arxiv.org/abs/1805.12152)

bpw1621 commented 4 years ago

This was a fantastic read on a topic I have not seen great literature review on before. Thanks a lot for taking the time to be as comprehensive as this seems to be!

ticiana commented 4 years ago

Very clear tutorial!! Thanks for your great job!

sbmaruf commented 4 years ago

Great review. A new paper for Generative Methods, https://arxiv.org/abs/2004.13240

wonyeongdeok commented 4 years ago

Thank you for your great works! I have a question. Can your findings be used in other languages? Excluding 'Back translation'

amitness commented 4 years ago

Thank you for your great works! I have a question. Can your findings be used in other languages? Excluding 'Back translation'

Some of them are applicable to other languages as well:

You can apply word-embedding based word replacement if you can find embeddings for your language. For example, fasttext has word vectors for 157 languages
The MixUp method is independent of any language as it works on the representations directly
The noising techniques like character swap, random swap/insertion/deletion, sentence shuffling should work as well.

wonyeongdeok commented 4 years ago

@amitness I am amazed by your rich knowledge. Your help will be very helpful to my project. Thank you very much!

aswin-giridhar commented 3 years ago

Thanks a lot, the article was very informative

yananchen1989 commented 3 years ago

It seems that these DA methods are only effective in a low-data regime. I tries these methods on text classification where I only sample 32 instances from each class and it works. However, if I enlarge the training samples, for example, 1000 samples each class, the DA does not work at all, in terms of accuracy. Is there any study and paper on this problem ?

amitness commented 3 years ago

@yananchen1989 Yes, your observation is correct.

A similar result was also shown in the Easy Data Augmentation paper. See the section "4.2 Training Set Sizing". The paper also has other ablation studies.

lethaiq commented 3 years ago

@amitness, In 2017 there is a paper that uses VAE to generate synthetic examples that significantly improve performance of clickbait detectors. This is published before recent efforts in using generative models such as GPT2. https://ieeexplore.ieee.org/abstract/document/9073621.

Eunhui-Kim commented 2 years ago

Thank you so much. It's so helpful to overview this area.

FeiyanLiu commented 2 years ago

Thank you so much.

amitness / blog-comments

2020/05/data-augmentation-for-nlp/ #3

A Visual Survey of Data Augmentation in NLP