Colin-Codes / IntentClassifier-ML-Project

Pyhton, Keras, SciKit-Learn, Matplotlib: Machine learning research project around classification of intent behind tech support emails in order to enable automatic follow up.
0 stars 0 forks source link

Apply Data augmentation #30

Closed Colin-Codes closed 4 years ago

Colin-Codes commented 4 years ago

https://www.geeksforgeeks.org/removing-stop-words-nltk-python/ https://www.researchgate.net/post/Is_there_any_data_augmentation_technique_for_text_data_set https://towardsdatascience.com/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610

Colin-Codes commented 4 years ago

This may be helpful if it ends up padding out the dataset somewhat too.

Colin-Codes commented 4 years ago
  1. Data augmentation: I'm looking into 'Easy Data Augmentation', which consists of padding the dataset with new training examples based on variations of the originals. Typically this is either by synonym replacement or random word deletion. This will make my dataset much less brittle, and will particularly aid in my KNN model. I'm confident that I can implement this.
Colin-Codes commented 4 years ago

https://arxiv.org/pdf/1901.11196.pdf

Colin-Codes commented 4 years ago

https://github.com/jasonwei20/eda_nlp

Colin-Codes commented 4 years ago

Ignore title line Create more examples for smaller classes e.g. largest class has n = n_aug, smaller classes

Colin-Codes commented 4 years ago

Evaluate if this improves results

Colin-Codes commented 4 years ago

https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html