Data Augmentation for Text Data - Githubissues

GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish

http://sadedegel.ai

MIT License

93 stars 15 forks source link

Data Augmentation for Text Data #279

Open ertugrul-dmr opened 3 years ago

ertugrul-dmr commented 3 years ago

It's hard to reach labelled text data and it's costly to label the data manually; but usually more data we have, better performance we can achieve. While working on text normalizations we can also consider text augmentations too.

Adding augmented text data might boost our model performances by increasing number of instances to train on. For this we can try several approaches, from simple to more complex ones:

Random Removal:

In this method we randomly select given percentage words in a document and delete them.

Synonym Replacement:

In this method we can augment the data by replacing words randomly with their synonyms.

Embedding Replacement:

In this method we can randomly replace words with most similar ones using k-nearest-neighbor and cosine similarity.
Alternatively we can use word2vec, GloVe, fasttext etc. for getting similar words to replace.

Character Replacement:

We can replace strings based on common typos with qwerty keyboards. We can randomly replace some characters with their nearest keyboard buttons.

Back Translation:

In this method we can translate document into another language using more complex models (like transformers) and translate them back to original language. It can give us augmented version of the document.

Text Generation:

We can generate synthetic data using text generators like GPT-2 This might give us more data to train with.