Closed sergeivolodin closed 6 years ago
Some comments on text cleaning and feature engineering , also the text has two links explaining word embeddings and a tutorial on CNN
lexical diversity pg 18 bigram pairs of words that might occur together collocations gets us most frequent bigrams PlaintextCorpusReader ConditionalFreqDist() pg 52 (positive/negative,word) pg 54 Tabulation of results pg104 word stems
Two main models: 1) Frequency based: to improve model add features regarding (unigrams/bigrams/Parts of Speech) Convert to lowercase Remove numbers Remove stopwords (and,but,I) Remove Punctuation Remove Whitespaces resulting from previous actions Tokenization IN R (See equivalent in Python) -> DocumentTermMatrix -> Rows are message Id Columns are Words ( cells indicate the number of times the word occured in that msg) eliminate any words that appear in less than about 0.1 percent of records in the training data.
To train use SVM(enrich model by kernel trick)/NaiveBayes/Maximum entropy
Extra NLP : lowercase -> strip affixes -> find if they are in dictionary normalization -> stemming -> lemminization
(How to handle hashtags/exlamation marks)
2) Word embedding based: https://medium.com/@thoszymkowiak/how-to-implement-sentiment-analysis-using-word-embedding-and-convolutional-neural-networks-on-keras-163197aef623
https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
Can use distance between words to account for word augmentation by Twitter users
Example: https://en.m.wikipedia.org/wiki/Levenshtein_distance
word count model: counting occurrences of words in tweets
Examples:
Need more good ideas