Closed utterances-bot closed 7 months ago
Thanks for this amazing article bro! I was just thinking, how can I do augmentation with text
Excellent article. Keep going!
Awesome work, summarized it.
1.Lexical Substitution:
2.Back Translation : English to other language - back to english
3.Text Surface Transformation : transforming through contraction and expansion
4.Random Noise Injection :
5.Instance Cross Augmentation:tweets with same polarity have their halves swapped
6.Syntax-tree Manipulation:active voice to passive voice
7.MixUp for Text :
8.Generative Methods : Generates additiona training data
Wonderful article pretty informative
Thanks for the wonderful review. Please note that the Generative Methods technic you presented (8) was first proposed by the paper: Not Enough Data? Deep Learning to the Rescue! (https://arxiv.org/abs/1911.03118) I think I saw it in the AAAI20 conference.
@NLP-cr Thank you for pointing that out. I've reviewed the link you shared and have corrected the relevant section.
Nice list! One more to add. I've seen text adversarial examples being used as data augmentation with some success (e.g. https://www.aclweb.org/anthology/N18-1089/), although this works best for small datasets, and may reduce accuracy for larger ones (https://arxiv.org/abs/1805.12152)
This was a fantastic read on a topic I have not seen great literature review on before. Thanks a lot for taking the time to be as comprehensive as this seems to be!
Very clear tutorial!! Thanks for your great job!
Great review. A new paper for Generative Methods, https://arxiv.org/abs/2004.13240
Thank you for your great works! I have a question. Can your findings be used in other languages? Excluding 'Back translation'
Thank you for your great works! I have a question. Can your findings be used in other languages? Excluding 'Back translation'
Some of them are applicable to other languages as well:
@amitness I am amazed by your rich knowledge. Your help will be very helpful to my project. Thank you very much!
Thanks a lot, the article was very informative
It seems that these DA methods are only effective in a low-data regime. I tries these methods on text classification where I only sample 32 instances from each class and it works. However, if I enlarge the training samples, for example, 1000 samples each class, the DA does not work at all, in terms of accuracy. Is there any study and paper on this problem ?
@yananchen1989 Yes, your observation is correct.
A similar result was also shown in the Easy Data Augmentation paper. See the section "4.2 Training Set Sizing". The paper also has other ablation studies.
@amitness, In 2017 there is a paper that uses VAE to generate synthetic examples that significantly improve performance of clickbait detectors. This is published before recent efforts in using generative models such as GPT2. https://ieeexplore.ieee.org/abstract/document/9073621.
Thank you so much. It's so helpful to overview this area.
Thank you so much.
A Visual Survey of Data Augmentation in NLP
An extensive overview of text data augmentation techniques for Natural Language Processing
https://amitness.com/2020/05/data-augmentation-for-nlp/