EPFLMachineLearningTeamYoor / Project02

Project 2: text classification
MIT License
1 stars 1 forks source link

Research into state-of-the-art sentiment analysis #2

Closed sergeivolodin closed 6 years ago

sergeivolodin commented 6 years ago

Examples:

  1. http://deeplearning.net/tutorial/lstm.html
  2. http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

Need more good ideas

sergeivolodin commented 6 years ago

Glove (baseline) #1: https://nlp.stanford.edu/projects/glove/ CNN #4: https://medium.com/@thoszymkowiak/how-to-implement-sentiment-analysis-using-word-embedding-and-convolutional-neural-networks-on-keras-163197aef623

okm02 commented 6 years ago

Some comments on text cleaning and feature engineering , also the text has two links explaining word embeddings and a tutorial on CNN

lexical diversity pg 18 bigram pairs of words that might occur together collocations gets us most frequent bigrams PlaintextCorpusReader ConditionalFreqDist() pg 52 (positive/negative,word) pg 54 Tabulation of results pg104 word stems

Two main models: 1) Frequency based: to improve model add features regarding (unigrams/bigrams/Parts of Speech) Convert to lowercase Remove numbers Remove stopwords (and,but,I) Remove Punctuation Remove Whitespaces resulting from previous actions Tokenization IN R (See equivalent in Python) -> DocumentTermMatrix -> Rows are message Id Columns are Words ( cells indicate the number of times the word occured in that msg) eliminate any words that appear in less than about 0.1 percent of records in the training data.

To train use SVM(enrich model by kernel trick)/NaiveBayes/Maximum entropy

Extra NLP : lowercase -> strip affixes -> find if they are in dictionary normalization -> stemming -> lemminization

(How to handle hashtags/exlamation marks)

2) Word embedding based: https://medium.com/@thoszymkowiak/how-to-implement-sentiment-analysis-using-word-embedding-and-convolutional-neural-networks-on-keras-163197aef623

https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

sergeivolodin commented 6 years ago

Can use distance between words to account for word augmentation by Twitter users

Example: https://en.m.wikipedia.org/wiki/Levenshtein_distance

sergeivolodin commented 6 years ago

https://github.com/facebookresearch/fastText

sergeivolodin commented 6 years ago

word count model: counting occurrences of words in tweets